Upload
loffycraze
View
237
Download
0
Embed Size (px)
Citation preview
8/10/2019 embedded system architecture by Ralf Niemann
1/130
Hardware/Software Codesign
of Embedded Systems
Petru Eles and Zebo Peng
Embedded Systems Laboratory (ESLAB)Linkping University
Embedded Tutorial
Lecture ContentsLecture Contents
=
Introduction and basic issues.
= Architectures and platforms.
= Analysis, co-simulation, and design space
8/10/2019 embedded system architecture by Ralf Niemann
2/130
3Prof. Z. Peng, ESLAB/LiTH
IntroductionIntroduction
= Codesign of embeddedsystems
= Definition and motivation
= The design flows
=
System level design issues
Traditional Design FlowTraditional Design Flow
Informal System Specification
Early, Manual Partitioning
HW SpecificationSW Specification
8/10/2019 embedded system architecture by Ralf Niemann
3/130
5Prof. Z. Peng, ESLAB/LiTH
timetime
Design TimeDesign Time
Specification
& Partitioning
HW Design
&Simulation
SW Design&
Simulation
Integration&
Test
Traditional Design: HW/SW Codesign:
Specification
& Partitioning
HW Design
&Simulation
SW Design&
Simulation
Integration&
Test
Co-sim.
&Co-verif.
Reduced TTM
HW/SW CodesignHW/SW Codesign
= The concurrent design of hardware and
software elements, supporting explicithardware/software trade-off.
0 Co-specification to create an commonspecification that describes both hardware andsoftware elements
8/10/2019 embedded system architecture by Ralf Niemann
4/130
7Prof. Z. Peng, ESLAB/LiTH
Why Codesign?Why Codesign?
=
Reduce time-to-market.= Achieve better designs:
0 More design alternatives can be explored.
0 Better solutions can be found by advanced optimizationtechniques.
= To meet strict design constraints, such as:
0 Timing or performance constraints.0 Power dissipation.
0 Physical constraints, e.g., size, weight, etc.
0 Safety and reliability constraints.
0 Cost constraints.
= Codesign is also made possible by the advances in
design methodologies and tools.
Vertical CodesignVertical Codesign
= Instruction set processor design, for both general-purpose systems and ASIPs (Application Specific
Instruction Processors).
To determine how big the
hardware engine you need to
Specification
8/10/2019 embedded system architecture by Ralf Niemann
5/130
9Prof. Z. Peng, ESLAB/LiTH
Codesign of ProcessorsCodesign of Processors
= General-Purpose Processors0 Architectural support for operating systems.
0 Cache design and tuning (e.g., selection of cachesize and control schemes).
0 Pipeline control design (control mechanisms,compiler design).
= ASIPs
0 Customization of instruction sets and specificresources (e.g., accelerator and coprocessor).
0 Design of register files, busses andinterconnections.
0 Development of specific compiler.
Horizontal CodesignHorizontal Codesign
= Some of system functionality is implemented insoftware running on programmable CPUs, while other
functions are implemented in hardware.= Typical for design of embedded systems.
SpecificationCodesign of
Specialized processor
8/10/2019 embedded system architecture by Ralf Niemann
6/130
11Prof. Z. Peng, ESLAB/LiTH
What is an Embedded System?What is an Embedded System?
= There are many different definitions!0 A special-purpose computer system that is used for a
particular task.
0 A computer based systems embedded in real lifemachines. Though computer based, it dose not have theusual key-board and monitors. The processor and relatedcircuitry are configured to do a specific task.
= Some highlights what it is (not) used for:0 Any device which includes a programmable component but
itself is not intended to be a general purpose computer.
= Some focus on what it is built from:0 A collection of programmable parts surrounded by ASICs
and other standard components, that interact continuously
with an environment through sensors and actuators.
Characteristics of an Embedded SystemCharacteristics of an Embedded System
= Dedicated (not general purpose).
0 One or several applications known at design-time.
= Contains a programmable component.0 But usually not programmable by the end-user.
= Interacts (continuously) with the environment:
0 Real-time behavior.
8/10/2019 embedded system architecture by Ralf Niemann
7/130
13Prof. Z. Peng, ESLAB/LiTH
1%
99%
Embedded Systems
General purpose systems Embedded systems
Microprocessormarket sharesin 1999
Actuat
Embedded ControllersEmbedded Controllers
CPUMemorySenso
8/10/2019 embedded system architecture by Ralf Niemann
8/130
15Prof. Z. Peng, ESLAB/LiTH
I/O Interface
Network Interface
CPU
RA M
ROM
ASIC
Actuators Sensors
Distributed Embedded SystemsDistributed Embedded Systems
ECU ECU ECU
Gateway
Gateway
ECU ECU ECU
Time and Power ConstraintsTime and Power Constraints
= Time constraints:
0 They have to perform in real-time: if data are not ready by
a certain deadline, the system fails to perform correctly.0 Hard deadline failure to meet leads to major hazards.
0 Soft deadline failure to meet can be tolerated but quality
of service is reduced.
P t i t
8/10/2019 embedded system architecture by Ralf Niemann
9/130
17Prof. Z. Peng, ESLAB/LiTH
Safety Critical RequirementsSafety Critical Requirements
= Embedded systems are often used in lifecritical applications.
0 Avionics, automotive electronics, nuclear plants,medical applications, military applications, etc.
= Reliability and safety are major requirements.
= To guarantee correctness during design:0 Formal verification: Mathematics-based methods
to verify certain properties of the designedsystem.
0 Automatic synthesis: Certain design steps areautomatically performed by design tools
Correctness by construction.
Short Time to MarketShort Time to Market
= In highly competitive markets it is critical to catchthe market window:
0 A short delay with the product on the market can havecatastrophic financial consequences (even if the quality of
the product is excellent).
= Design time has to be reduced!
8/10/2019 embedded system architecture by Ralf Niemann
10/130
19Prof. Z. Peng, ESLAB/LiTH
The ES Design ChallengesThe ES Design Challenges
=
Increasing application complexity (e.g., automotive).= Heterogeneous architecture (HW, SW, network,
mechatronics, etc.).
= Stringent time and power constraints.
= Low cost requirement.
= Short time to market.
= Safety and reliability (e.g., very long life-time).
= In order to achieve all these requirements, systemshave to be highly optimized.
= Both hardware and software aspects have to beconsidered simultaneously!
Current Design PracticeCurrent Design Practice
1. Start from some informal specification and a set ofconstraints (time, power, and cost constraints).
2. Generate a more formal specification, based on somemodeling concept (FSM, data-flow, etc.), usingMatlab, Statecharts, SystemC, C, UML, or VHDL.
3. Simulate the model in order to check itsfunctionality. The model is modified, if needed.
8/10/2019 embedded system architecture by Ralf Niemann
11/130
21Prof. Z. Peng, ESLAB/LiTH
The ConsequencesThe Consequences
= Delays in the design process:0 Increased design cost.
0 Delays in time to market missed market window.
= High cost due to many iterations withimplementation and prototyping.
= Bad design decisions taken under time pressure:0 Low quality.
0 High cost.
= The lesson: We need to explore more designalternatives in an efficient manner.0 At the system level!
SystemSystem--Level DesignLevel Design
Informal Specification,Constraints
FormalVerification
FunctionalSimulation
System Model
Modeling
Arch. Selection
SystemArchitecture
Mapping
8/10/2019 embedded system architecture by Ralf Niemann
12/130
23Prof. Z. Peng, ESLAB/LiTH
The Improved Design FlowThe Improved Design Flow
= Several design alternatives are evaluatedbefore going down to the lower-level design.
0 This is performed as part of the design spaceexploration process.
0 Different architectures, mappings and schedulesare explored, before the actual implementation
and prototyping.
= We get highly optimized solutions in shorttime.
0 There is a good chance that design iterations atthe lower-level, including prototyping, can beavoided.
Additional ImprovementsAdditional Improvements
= Formal verification0 It is impossible to do an exhaustive simulation.
0
Especially for safety critical systems, formal verification isneeded.
= Simulation0 Used not only for functional validation.
0 Should also be used after mapping and scheduling in orderto check, for example, timing properties.
8/10/2019 embedded system architecture by Ralf Niemann
13/130
25Prof. Z. Peng, ESLAB/LiTH
The LowerThe Lower--Level IssuesLevel Issues
= Software generation:0 Encoding in an implementation language (C, C++,
assembler).
0 Compiling (this can include particular optimizations forapplication specific processors, DSPs, etc.).
0 Generation of a real-time kernel or adapting to an existingoperating system.
= Hardware synthesis:0 Encoding in a HDL (VHDL and Verilog).0 Successive synthesis steps: high-level, register-transfer
level, logic-level synthesis.
= Hardware/software integration:0 The software is run together with the hardware model
(co-simulation).
= Prototyping:0 A prototype of the hardware is constructed and the
software is executed on the target architecture.
LowerLower--Level DesignLevel Design
There are established CAD tools on the market whichautomatically perform many of the low level tasks:
= Code generators (software model C, hardwaremodel VHDL)
= Compilers.
H d th i t l
8/10/2019 embedded system architecture by Ralf Niemann
14/130
27Prof. Z. Peng, ESLAB/LiTH
Focus on SystemFocus on System--Level DesignLevel Design
= Have huge influence on the quality of the finalimplementation.
= Very few commercial tools are available.
= Mostly experimental and academic tools available.
= Huge efforts and investments are currently made in
order to develop tools and methodologies for systemlevel design.
= Ad-hoc solutions are less and less acceptable.
= It is the system level we are mainly interested, in
this course!
Concluding RemarksConcluding Remarks
= Codesign provides the capability to make
explicit and efficient hardware/softwaretrade-off.
= Codesign of embedded systems have manyadvantages and challenges.
8/10/2019 embedded system architecture by Ralf Niemann
15/130
Analysis, Co-Simulation
and Design Space Exploration
Zebo Peng
Embedded Systems Laboratory (ESLAB)Linkping University
OutlineOutline
= Static analysis techniques
= Design space exploration
8/10/2019 embedded system architecture by Ralf Niemann
16/130
3Prof. Z. Peng, ESLAB/LiTH
The Design SpaceThe Design Space
= Very large due to many solution parameters:
0 architectures and components
0 hardware/software partitioning
0 mapping and scheduling
0 operating systems and global control
0 communication synthesis
Source: S3
Source: Stratus
Computers
Hardware Software
Embedded
memory
DSP
Network
High-speed electronicsSensor
Analog
circuit
ASIC
Microprocessor
SoC
Design Space ExplorationDesign Space Exploration
What are needed in order to explore the complexdesign space to find a good solution:
= Exploration in the higher level of abstractions.
= Development of high-level analysis and estimationtechniques.
= Employment of very fast exploration algorithms
8/10/2019 embedded system architecture by Ralf Niemann
17/130
5Prof. Z. Peng, ESLAB/LiTH
The Optimization ProblemThe Optimization Problem
The majority of design space exploration tasks can beviewed as optimization problems:
To find
0 the architecture (type and number of processors, memory
modules, and communication blocks, as well as their
interconnections),
0 the mapping of functionality onto the architecturecomponents, and
0 the schedules of basic functions and communications,
such that a cost function (in terms of implementationcost, performance, power, etc.) is minimized and aset of constraints is satisfied.
The System Partitioning ProblemThe System Partitioning Problem
5
8
35
2
3
45
5
4
35
5
6
5665
24
20
40
67
15
23
Two-way partitioning
8/10/2019 embedded system architecture by Ralf Niemann
18/130
7Prof. Z. Peng, ESLAB/LiTH
Hardware/Software PartitioningHardware/Software Partitioning
Input: Implementation independent systemspecification consisting of interactingprocesses (e.g., VHDL).
Output: Two sets of processes, assigned for hardwareand software implementation respectively.
Target architecture:
- Microprocessors
- ASICs
- Shared memories
Hardware/Software PartitioningHardware/Software Partitioning
Assumptions:
= Microprocessor and ASIC working in parallel;
= Reducing the amount of communication betweenthe microprocessor and hardware improves the
overall performance.
Objectives:
8/10/2019 embedded system architecture by Ralf Niemann
19/130
9Prof. Z. Peng, ESLAB/LiTH
Hardware/Software PartitioningHardware/Software Partitioning
= Quantitative values can be derived via simulation,profiling, or static analysis of the specification.
Ex.
0 computation load(CL) number of operations executed
by a basic region or process of the specification.
0 communication intensity(CI) total number of
communication operations on a channel between twoprocesses.
= Performance improvement based on:
0 Placing computation intensive processes into hardware.
0 Increasing parallelism.
0 Reducing inter-domain communication.
Process Graph FormulationProcess Graph Formulation
= nodes correspond to processes, which could beprocesses or basic blocks in the original specification
(e.g., VHDL).= node weights reflect the degree of suitability for
hardware implementation of the correspondingprocess:
the computation load of the process;
8/10/2019 embedded system architecture by Ralf Niemann
20/130
11Prof. Z. Peng, ESLAB/LiTH
Process Graph FormulationProcess Graph Formulation
= The Graph Partitioning Problem:To partition the process graph into two groups such
that the sum of the weights of the cut edges will beminimal, subject to a set of constraints:
Ex.
HiH
i MaxtH cos_ Physical limitation of silicon area
HwiLimWNi 1
Implement a node in HW, when
it is appropriate.
Features of CO ProblemsFeatures of CO Problems
= Most CO problems, e.g., system partitioning with
constraints, for digital system designs are NP-
compete.
= The time needed to solve an NP-compete problemgrows exponentially with respect to the problem sizen.
8/10/2019 embedded system architecture by Ralf Niemann
21/130
13Prof. Z. Peng, ESLAB/LiTH
Features of CO ProblemsFeatures of CO Problems
= Many CO problems can be formulated as an IntegerLinear Programming (ILP) problem, and solved by anILP solver.
= It is inherently more difficult to solve an ILP problemthan the corresponding Linear Programming problem.
= The size of problem that can be solved successfully
by ILP algorithms is an order of magnitude smallerthan the size of LP problems that can be easilysolved.
HeuristicsHeuristics
= A heuristic seeks near-optimal solutions at areasonable computational cost without being able to
guarantee either optimality or feasibility.= Motivations:
0 Many exact algorithms involve a huge amount ofcomputation effort.
0 The decision variables have frequently complicated
8/10/2019 embedded system architecture by Ralf Niemann
22/130
15Prof. Z. Peng, ESLAB/LiTH
Heuristic Approaches to COHeuristic Approaches to CO
Problem specific Generic methods
Clustering
List scheduling
Left-edge algorithm
Branch and bound
Divide and conquer
Constructive
Transformational
(Iterativeimprovemen
t)
Kernighan-Lin
algorithm
Neighborhood search
Simulated annealing
Tabu search Genetic algorithms
(MetalH
euris
tics)
Clustering for System PartitioningClustering for System Partitioning
= Each node initially belongs to its own cluster, andclusters are then gradually merged until the desiredpartitioning is found.
= The merge operation is selected based on localinformation (closeness metrics), rather than globalview of the whole system.
v22
v23
8/10/2019 embedded system architecture by Ralf Niemann
23/130
17Prof. Z. Peng, ESLAB/LiTH
The KernighanThe Kernighan--Lin Algorithm (KL)Lin Algorithm (KL)
= A graph is partitioned into two clusters ofarbitrary size, by minimizing a givenobjective function.
= KL is based on an iterative partitioningstrategy:
0 The algorithm starts with two arbitrary clustersC1 and C2.
0 The partitioning is then iteratively improved bymoving nodes between the clusters.
0 At each iteration, the node which produces theminimal value of the cost function is moved; thisvalue can, however, be greater than the value
before moving the node.
BranchBranch--andand--BoundBound
= Traverse an implicit tree to find the best leaf (solution).
4-City TSP
0 1 2 3
0 3 6 410
3
0
1
8/10/2019 embedded system architecture by Ralf Niemann
24/130
19Prof. Z. Peng, ESLAB/LiTH
BranchBranch--andand--Bound ExBound Ex0 1 2 3
0 3 6 41
0 40 5
0 4
0
0
1
2
3{0}
{0,1}
{0,1,2}
{0,1,2,3}
L = 88
L 0
L 3
L 43
{0,1,3}
{0,1,3,2}
L 8
L = 18
{0,2}L 6
{0,2,1}
L 46
{0,2,1,3}
L = 92
{0,3}L 41
{0,2,3}
{0,2,3,1}
L 10
L = 18
{0,3,1} {0,3,2}
{0,3,1,2} {0,3,2,1}
L 46 L 45
L = 92 L = 88
= Low-bound on the cost function.
= Search strategy
Neighborhood Search MethodNeighborhood Search Method
= Step 1 (Initialization)(A) Select a starting solution xnow X.(B) xbest = xnow, best_cost = c(xbest).
= Step 2 (Choice and termination)Choose a solution xnext N(xnow).If no solution can be selected or the terminating criteria apply,
then the method stop
8/10/2019 embedded system architecture by Ralf Niemann
25/130
21Prof. Z. Peng, ESLAB/LiTH
Neighborhood Search MethodNeighborhood Search Method
=
The neighborhood search method is very attractive formany CO problems as they have a natural neighborhoodstructure, which can be easily defined and evaluated.0 Ex. Graph partitioning: swapping two nodes.
5
8
35
2
3
45
5
4
35
5
6
5665
24
20
40
67
15
23
5
8
35
2
3
45
5
4
35
5
6
5665
24
20
40
67
15
23
The Descent MethodThe Descent Method
= Step 1 (Initialization)
= Step 2 (Choice and termination)
Choose xnext N(xnow) such that c(xnext) < c(xnow), andterminate if no such xnext can be found.
= Step 3 (Update)
The descent process can easily be stuck at a local
8/10/2019 embedded system architecture by Ralf Niemann
26/130
23Prof. Z. Peng, ESLAB/LiTH
Dealing with Local OptimalityDealing with Local Optimality
= Enlarge the neighborhood.
Cost
Solutions
= Start with different initial solutions.
X
= To allow uphill moves:
0 Simulated annealing
0 Tabu search
The SA AlgorithmThe SA Algorithm
Select an initial solution xnow X;Select an initial temperature t> 0;
Select a temperature reduction function ;RepeatRepeat
Randomly select xnext N(xnow); = c o s t (xnext) - c o s t (xnow);
8/10/2019 embedded system architecture by Ralf Niemann
27/130
25Prof. Z. Peng, ESLAB/LiTH
A HW/SW Partitioning ExampleA HW/SW Partitioning Example
35000
40000
45000
50000
55000
60000
65000
70000
75000
0 200 400 600 800 1000 1200 1400
Number of iterations
Costfuncti
onv
alue
optimum at iteration 1006
Analysis TechniquesAnalysis Techniques
= Analysis and simulation techniques are essential for
hardware/software codesign:
0 To guide the design space exploration.
0 To provide feedback to the human designers.
0 To support design validation.
S l ti f l i / i l ti t h i i
8/10/2019 embedded system architecture by Ralf Niemann
28/130
27Prof. Z. Peng, ESLAB/LiTH
Performance MetricsPerformance Metrics
=
Extreme case performance0 Worst-case execution time
0 Best-case execution time
= Average case performance
= Probabilistic performance
0 Used in soft real-time applications
0
To accurately handle the variable execution time of tasks,which may be due to
Application characteristics (e.g., data dependent loops);
Architectural factors (e.g., cache misses);
External factors (e.g., network load); or
Insufficient knowledge.
0 To guarantee a high probability of meeting timing
constraints.
SimulationSimulation--based Techniquesbased Techniques
= Software Running the compiled programon the simulated target architecture.
= Hardware Building a simulation model ofthe hardware and executing it to collectinformation.
A very large number of inputs should be used
8/10/2019 embedded system architecture by Ralf Niemann
29/130
8/10/2019 embedded system architecture by Ralf Niemann
30/130
31Prof. Z. Peng, ESLAB/LiTH
Program Path AnalysisProgram Path Analysis
= To determine what sequence of instructions will be
executed in the worst case scenario.
A basic block is composed of
instructions in a straight line
= Let us first assume thateach instruction takes afixed time to execute
Program Path AnalysisProgram Path Analysis
= Infeasible paths can be eliminated by dataflow analysis and path information provided
by the programmer.= The number of feasible paths is typically
exponential with the program size.
Efficient methods are needed to avoid
8/10/2019 embedded system architecture by Ralf Niemann
31/130
33Prof. Z. Peng, ESLAB/LiTH
ILP FormulationILP Formulation
Letxibe the number of times a basic block Bi is executed;
cibe the execution time of the basic block Bi, which isassumed to be a constant.
The total execution time of the program for a particularexecution is:
=
N
iii xc
1
1
10
1
11
101
C1
C2
C3
C4
C5
C6
C7
C1+ C
2+ C
4+ 11 C
5+ 10 C
6+ C
7
ILP Formulation (ContILP Formulation (Contd)d)
The estimated WCET of the program is:
subject to a set of constraints Ax b.
=
N
i
ii xc
1max
8/10/2019 embedded system architecture by Ralf Niemann
32/130
35Prof. Z. Peng, ESLAB/LiTH
An ExampleAn Example
/* k >= 0 */s = k;while (k < 10) {
if (ok)j++;
else {j = 0;ok = true;
}k++;
}r = j;
x1 s = k;B1
d1
d2
x2 while (k
8/10/2019 embedded system architecture by Ralf Niemann
33/130
37Prof. Z. Peng, ESLAB/LiTH
Constraints IIConstraints II
=
Functionality constraints:
Loop bound information
0x1 x3 10x1Path information
x5 1x1
/* k >= 0 */s = k;while (k < 10) {
if (ok)j++;
else {j = 0;ok = true;
}k++;
}r = j;
X1X2X3X4
X5
X6
X7
Remarks on Performance AnalysisRemarks on Performance Analysis
= One of the main issues of hardware/software
codesign is estimation and analysis.
= Analysis of average and probabilistic performance
can be done by simulation.
= Worst case execution time analysis can only be
ffi i l d b i l i h i
8/10/2019 embedded system architecture by Ralf Niemann
34/130
39Prof. Z. Peng, ESLAB/LiTH
SimulationSimulation
= Applied usually directly to the designdescriptions, e.g. VHDL.
= Can be used at different levels ofabstractions:
0 System
0 Algorithmic
0 Register-transfer
0 Logic
0 Gate
0 Switch and circuit
CoCo--SimulationSimulation
= How the hardware and software components are
simulated at the same time?
Problems:
= Different simulation platforms are used;
= Software runs fast while hardware simulation is
8/10/2019 embedded system architecture by Ralf Niemann
35/130
41Prof. Z. Peng, ESLAB/LiTH
Approaches to CoApproaches to Co--Simulation 1Simulation 1
= Gate-level model of the processor
0 Gate level simulation of the processor is very slow (tens ofclock cycles/sec).
Ex. 10 cycles/sec, 1 GHz processor 100 million seconds(3.2 years) are needed to simulate one second of real time.
0
This provides a very accurate solution and is very simplefrom the co-simulation point of view.
Gate-
level
model
(VHDL)
SW
ASIC
model
(VHDL)
VHDL
simulation VHDL
simulation
Co-simulation framework
Approaches to CoApproaches to Co--Simulation 2Simulation 2
= Instruction-set architecture models
ISA
model
(C
progr.)
SW
ASICmodel
(VHDL)
Program
running
on hostVHDL
simulation
8/10/2019 embedded system architecture by Ralf Niemann
36/130
8/10/2019 embedded system architecture by Ralf Niemann
37/130
Hardware/Software Codesign Arch & Platf - 1
8/10/2019 embedded system architecture by Ralf Niemann
38/130
Petru Eles, IDA, LiTH
Architectures and Platforms
1. Architecture Selection: The Basic Trade-Offs
2. General Purpose vs. Application-Specific Processors
3. Processor Specialisation
4. ASIP Design Flow
5. Specialisation of a VLIW ASIP
6. Tool Support for Processor Specialisation
7. Application Specific Platforms
8. IP-Based Design (Design Reuse)9. Reconfigurable Systems
Hardware/Software Codesign Arch & Platf - 2
8/10/2019 embedded system architecture by Ralf Niemann
39/130
Remember the Design Flow
System model
Informal Specification,Constraints
Functional
Simulation
Modeling
Arch. Selection
Systemarchitecture
Mapping
Estimation
Mapped and
scheduled model
Scheduling
OK
not OK not OK
FormalVerification
Softw. model Hardw. model
SimulationFormal
Verification
Softw. Generation Hardw. Synthesis
Simulation
Hardware/Software Codesign Arch & Platf - 3
8/10/2019 embedded system architecture by Ralf Niemann
40/130
Petru Eles, IDA, LiTH
Architecture Selection and Mapping
Select the underlying hardware structure on which to run themodelled system.
Map the functionality captured by the system over thecomponents of the selected architecture.Functionality includes processing and communication.
Hardware/Software Codesign Arch & Platf - 4
8/10/2019 embedded system architecture by Ralf Niemann
41/130
Architecture Selection
Build a customised architecture strictlyoptimised for the particular application.
Use a general purpose, existing platformand map the application on it.
Use programmable processors
running software.
Use dedicated electronicsfixed
reconfigurable
or something in-between
or both
General
Purposevs.ApplicationSpecific
Softwarevs.Hardware
Hardware/Software Codesign Arch & Platf - 5
8/10/2019 embedded system architecture by Ralf Niemann
42/130
Petru Eles, IDA, LiTH
Architecture Selection (contd)
The trade-offs:
Performance (high speed, low power consumption)
Flexibility (how easy it is to upgrade or modify)
Application specific
General purpose
Hardware
Software
high
low
high
low
Reconfigurablehardware
Application specific
General purpose
Hardware
Softwarehigh
low
high
low
Reconfigurablehardware
Hardware/Software Codesign Arch & Platf - 6
8/10/2019 embedded system architecture by Ralf Niemann
43/130
Petru Eles, IDA, LiTH
Architecture Selection (contd)
flexibility
energy
consumed
low
low
med.
med.
high
high
orderof
m
agnitude
o
rderof
ma
gnitude
ASIC
FPGA
ASIP
GP proc.
Hardware/Software Codesign Arch & Platf - 7
8/10/2019 embedded system architecture by Ralf Niemann
44/130
Petru Eles, IDA, LiTH
General Purpose vs. Application Specific Processors
Both GP processors and ASIPs (application specific instruction setprocessors) can be RISCs, CISCs, DSPs, microcontrollers, etc.
- One could look at DSPs and microcontrollers as being specificfor DSP and simple control applications respectively.
- An application specific DSP or microcontroller is, however,more specialised thenjustfor DSP or control applications.
GP processors
- Neither instruction set nor microarchitecture or memorysystem are customised for a particular application or family ofapplications
ASIPs
- Instruction set, microarchitecture and/or memory system arecustomised for an application or family of applications.
- What results is better performance and reduced powerconsumption.
Hardware/Software Codesign Arch & Platf - 8
8/10/2019 embedded system architecture by Ralf Niemann
45/130
Petru Eles, IDA, LiTH
What Makes an ASIP Specific?
What can we specialize in a processor?
Instruction set (IS) specialisation
Exclude instructions which are not used
- reduces instruction word length (fewer bits needed for encoding);
- keeps controller and data path simple.
Introduce instructions, even exotic ones, which are specific to theapplication: combinations of arithmetic instructions (multiply-accumulate), small algorithms (encoding/decoding, filter), vector
operations, string manipulation or string matching, pixel operations, etc.- reduces code sizereduced memory size, memory bandwidth,
power consumption, execution time.
Hardware/Software Codesign Arch & Platf - 9
8/10/2019 embedded system architecture by Ralf Niemann
46/130
Petru Eles, IDA, LiTH
What Makes an ASIP Specific?
Function unit and data path specialisation
Once an application specific IS is defined, this IS can be
implemented using a more or less specific data path and more orless specific function units.
Adaptation of word length.
Adaptation of register number.
Adaptation of functional units
- Highly specialised functional units can be introduced for stringmatching and manipulation, pixel operation, arithmetics, and
even complex units to perform certain sequences ofcomputations (co-processors).
8/10/2019 embedded system architecture by Ralf Niemann
47/130
Hardware/Software Codesign Arch & Platf - 11
8/10/2019 embedded system architecture by Ralf Niemann
48/130
Petru Eles, IDA, LiTH
What Makes an ASIP Specific?
Interconnect specialization
Interconnect of functional modules and registers.
Interconnect to memory and cache.
- How many internal buses?
- What kind of protocol?
- Additional connections increase the potential of parallelism.
Control specialisation
Centralised control or distributed (globally asynchronous)?
Pipelining?
Out of order execution?
Hardwired or microprogrammed?
Hardware/Software Codesign Arch & Platf - 12
8/10/2019 embedded system architecture by Ralf Niemann
49/130
Petru Eles, IDA, LiTH
ASIP Design Flow
(It can be seen as a part of the big design flow - slide 2)
Algorithm(s)
Simulator
ProcessorArchitecture
Compiler
Performancenumbers
Hardware/Software Codesign Arch & Platf - 13
8/10/2019 embedded system architecture by Ralf Niemann
50/130
Petru Eles, IDA, LiTH
A SOC for Multimedia Applications
Glue logic
A/D and D/A
Controller(ASIP)
On-chipmemory
DSP(GP)
VLIWprocessor
(ASIP)
The application specificController performsmaster control of thesystem and memory
access control.
The off-the-shelf (GP)DSP performs lesscomputation intensive
modem and sound codecfunctions.
The VLIW ASIP performscomputation intensivefunctions: discrete cosine
and inverse discretecosine transforms,motion estimation, etc.
This is a typical application specificplatform. Its structure has beenadapted for a family of applications.
Besides GP processor cores, theplatform also consists of ASIP coreswhich themselves are specialised.
8/10/2019 embedded system architecture by Ralf Niemann
51/130
Hardware/Software Codesign Arch & Platf - 15
8/10/2019 embedded system architecture by Ralf Niemann
52/130
Petru Eles, IDA, LiTH
Specialization of a VLIW ASIP (contd)
Thats how an instruction word looks like:
op4 op5 op6 op7 op8 op9 op10 op11op1 op2 op3
Cluster 1 Cluster 2 Cluster 3
Hardware/Software Codesign Arch & Platf - 16
8/10/2019 embedded system architecture by Ralf Niemann
53/130
Petru Eles, IDA, LiTH
Specialization of a VLIW ASIP (contd)
Traditionally the datapath is organised as single register file shared byall functional units.
Problem: Such a centralised structure does not scale!
We increase the nr. of functional units in order to increase parallelism
We have to increase the number of registers in the register file
Internal storage and communication between functional units andregisters becomes dominant in terms of area, delay, and power.
High performance VLIW processors are limited not by arithmeticcapacity but by internal bandwidth.
Hardware/Software Codesign Arch & Platf - 17
8/10/2019 embedded system architecture by Ralf Niemann
54/130
Petru Eles, IDA, LiTH
Specialization of a VLIW ASIP (contd)
A solution: clustering.
Restrict the connectivity between functional units and registers, sothat each functional unit can read/write from/to a subset ofregisters.
Organise the datapath as clusters of functional units and local
register files.
Nothing is for free!!!Moving data between registers belonging to different clusters takesmuch time and power!
You have to drastically minimise the number of such moves by:- Carefully adapting the structure of clusters to the application.
- Using very clever compilers.
Hardware/Software Codesign Arch & Platf - 18
8/10/2019 embedded system architecture by Ralf Niemann
55/130
Petru Eles, IDA, LiTH
Specialization of a VLIW ASIP (contd)
Instruction set specialisation: nothing special.
Function unit and data path specialisation
- Determine the number of clusters.
- For each cluster determine
- the number and type of functional units;
- the dimension of the register file.
Memory specialisation is extremely important because we need tostream large amounts of data to the clusters at high rate; one has
to adapt the memory structure to the access characteristics of theapplication.
- determine the number and size of memory banks
Hardware/Software Codesign Arch & Platf - 19
8/10/2019 embedded system architecture by Ralf Niemann
56/130
Petru Eles, IDA, LiTH
Specialization of a VLIW ASIP (contd)
Interconnect specialization
- Determine the interconnect structure between clusters andfrom clusters to memory:
- one or several buses,
- crossbar interconnection
- etc.
Control specialisation:
Thats more or less done, as we have decided for a VLIW
processor.
Hardware/Software Codesign Arch & Platf - 20
8/10/2019 embedded system architecture by Ralf Niemann
57/130
Petru Eles, IDA, LiTH
Tool Support for Processor Specialisation
Look at the design flow on slide 12!
In order to be able to generate a specialised architecture you need:
Retargetable compiler
Configurable simulator
Hardware/Software Codesign Arch & Platf - 21
R bl C il
8/10/2019 embedded system architecture by Ralf Niemann
58/130
Petru Eles, IDA, LiTH
Retargetable Compiler
Retargetable compiler
Algorithm
Object code
ProcessorArchitecture
RetargetableCompiler
Hardware/Software Codesign Arch & Platf - 22
R t t bl C il ( td)
8/10/2019 embedded system architecture by Ralf Niemann
59/130
Petru Eles, IDA, LiTH
Retargetable Compiler (contd)
An automatically retargetable compilercan be used for a range ofdifferent target architectures.
The actual code optimization and code generation is done by thecompiler, based on a description of the target processor architecture.This description is formulated in a, so called, architecture descriptionlanguage.
Having a good compiler is not only important for the processorspecialisation process!
Once you have got your specialised ASIP you need a good compiler
in order to efficiently make use of it!
Hardware/Software Codesign Arch & Platf - 23
C fi bl Si l t
8/10/2019 embedded system architecture by Ralf Niemann
60/130
Petru Eles, IDA, LiTH
Configurable Simulator
Simulator
Processor
Architecture
Performancenumbers
Object code
Such a simulator can beconfigured for a particulararchitecture (based on an
architecture description)
In this context, the mostimportant output produced by
the simulator is performancenumbers:
- throughput
- delay
- power/energy consumption
Hardware/Software Codesign Arch & Platf - 24
Application Specific Platforms
8/10/2019 embedded system architecture by Ralf Niemann
61/130
Petru Eles, IDA, LiTH
Application Specific Platforms
Not only processors but also hardware platformscan be specialised
for classes of applications.
The platform will define a certain communication infrastructure
(buses and protocols), certain processor cores, peripherals,accelerators commonly used in the particular application area, andbasic memory structure.
Hardware/Software Codesign Arch & Platf - 25
Application Specific Platforms (contd)
8/10/2019 embedded system architecture by Ralf Niemann
62/130
Petru Eles, IDA, LiTH
Application Specific Platforms (cont d)
Proc.Core1 DMA Memory Bridge
PeripheralRecon-
figurable
logic
System bus
Peripheral bus
CacheProc.Core2
Proc.Core3
Peripheral
Hardware/Software Codesign Arch & Platf - 26
Application Specific Platforms (contd)
8/10/2019 embedded system architecture by Ralf Niemann
63/130
Petru Eles, IDA, LiTH
Application Specific Platforms (cont d)
Design space exploration for platform definition:
Simulator
PlatformArchitecture
Mapping/Compiling
Performancenumbers
Applications
Hardware/Software Codesign Arch & Platf - 27
Instantiating a Platform
8/10/2019 embedded system architecture by Ralf Niemann
64/130
Petru Eles, IDA, LiTH
Instantiating a Platform
Once we have an application, the chip to implement on will not bedesigned as a collection of independently developed blocks, but will
be an instance of an application specific platform.
The hardware platform will be refined by
- determining memory and cache size
- identifying the particular cores, peripherals to be used
- adding specific ASICs, accelerators
- determining the amount of reconfigurable logic (if needed)
Hardware/Software Codesign Arch & Platf - 28
Instantiating a Platform (contd)
8/10/2019 embedded system architecture by Ralf Niemann
65/130
Petru Eles, IDA, LiTH
Instantiating a Platform (cont d)
Simulator
PlatformInstance
Mapping/
Compiling
Performancenumbers
Application
PlatformArchitecture
8/10/2019 embedded system architecture by Ralf Niemann
66/130
Hardware/Software Codesign Arch & Platf - 30
IP-Based Design (Design Reuse)
8/10/2019 embedded system architecture by Ralf Niemann
67/130
Petru Eles, IDA, LiTH
IP Based Design (Design Reuse)
The key concept in order to increase designers productivity is reuse.
In order to manage the complexity of current large designs we do not
start from scratch but reuse as much as possible from previousdesigns, or use commercially available pre-designed IP blocks.
IP: intellectual property.
Some people call this IP-based design, core-based design, reusetechniques, etc.:
Core-based designis the process of composing a new system
design by reusing existing components.
Hardware/Software Codesign Arch & Platf - 31
IP-Based Design (contd)
8/10/2019 embedded system architecture by Ralf Niemann
68/130
Petru Eles, IDA, LiTH
g ( )
What are the blocks (cores) we reuse?
interfaces, encoders/decoders, filters, memories, timers,microcontroller-cores, DSP-cores, RISC-cores, GP processor-cores.
Possible(!) definition
A coreis a design block which is larger than a typical RTLcomponent.
Of course:We also reuse software components!
Hardware/Software Codesign Arch & Platf - 32
IP-Based Design (contd)
Lib Lib
8/10/2019 embedded system architecture by Ralf Niemann
69/130
Petru Eles, IDA, LiTH
What we have designed here can be: An application specific SOC
A platform to be further instantiated for a particular application.
Core 1 Core 2 Core 3
Library
Vendor A
Interconnection bus/switch
Library
Vendor B
Core 4processor
Library
Vendor C
Interface
I/O
glue glue glue
glue
Hardware/Software Codesign Arch & Platf - 33
Types of Cores
8/10/2019 embedded system architecture by Ralf Niemann
70/130
Petru Eles, IDA, LiTH
yp
Hard cores: are fully designed, placed, and routed by the supplier.
Firm cores: technology-mapped gate-level netlists.
A completely validated layout with definite timing
rapid integration low flexibility
less predictability flexibility duringplace and route
Hardware/Software Codesign Arch & Platf - 34
Types of Cores (contd)
8/10/2019 embedded system architecture by Ralf Niemann
71/130
Petru Eles, IDA, LiTH
Soft cores: synthesizable RTL or behavioral descriptions.
Flexibility can provide opportunities like e.g. adding applicationspecific instructions to a processor core by modifying thebehavioral description.
much work withintegration andverification.
maximal flexibility
Hardware/Software Codesign Arch & Platf - 35
Reconfigurable Systems
8/10/2019 embedded system architecture by Ralf Niemann
72/130
Petru Eles, IDA, LiTH
Programmable Hardware Circuits:
They implement arbitrary combinational or sequential circuits
and can be configured by loading a local memory that determinesthe interconnection among logic blocks.
Reconfiguration can be applied an unlimited number of times.
Main applications:
- Software acceleration
- Prototyping
Hardware/Software Codesign Arch & Platf - 36
Reconfigurable Systems (contd)
8/10/2019 embedded system architecture by Ralf Niemann
73/130
Petru Eles, IDA, LiTH
Dynamic reconfiguration: spacial and temporal partitioning
---------------------------------------------------------------------------------------------------------------------------------------
---------------------------------------------------------------------------------------------------------------------------------------------------------------------
Processor Memory
FPGA
Accelerator
att1
att2
att3
att4
temporally
partit
ioned
Hardware/Software Codesign Arch & Platf - 37
Reconfigurable Systems (contd)
8/10/2019 embedded system architecture by Ralf Niemann
74/130
Petru Eles, IDA, LiTH
System on Chip with dynamically reconfigurable datapath
Reconfigurabledatapath
Onchip
mem.
CPU
C code
Profiling &Kernel
extraction
Hw/Swpartitioning
Kernels
C codeDatapathsynthesis
Hardware/Software Codesign Arch & Platf - 38
Summary
8/10/2019 embedded system architecture by Ralf Niemann
75/130
Petru Eles, IDA, LiTH
Architecture selection is about making trade-offs along thedimensions of speed, cost, flexibility, and power consumption.
ASIPs are programmable processors, specialised for a particular
application or for a family of applications.
Specialisation of an ASIP concerns instruction set, function unitsand data path, memory system, interconnect, and control.
Two design tools are of great importance in order to performprocessor specialisation: retargetable compiler and configurablesimulator.
Not only processors can be specialised but also platforms. A
Platform is specialised to execute a certain family of applications.The particular hardware to be used for a given application is aspecialised instantiation of the platform.
Hardware/Software Codesign Arch & Platf - 39
Summary (contd)
8/10/2019 embedded system architecture by Ralf Niemann
76/130
Petru Eles, IDA, LiTH
Reuse is a key technique in order to achieve high designproductivity. Cores to be reused can be from interfaces and
decoders to filters and processors.
The three types of cores differ in their flexibility, predictability, andthe effort needed for integration: hard, firm, and soft cores.
Reconfigurable systems can provide good flexibility and, at thesame time, many of the advantages of classical hardwareimplementation. They are mainly used for software accelerationand prototyping.
Hardware/Software Codesign Low Power/Energy - 1
System-Level Power/Energy Optimization
8/10/2019 embedded system architecture by Ralf Niemann
77/130
Petru Eles, IDA, LiTH
1. Sources of Power Dissipation
2. Reducing Power Consumption
3. System Level Power Optimization
4. Dynamic Power Management
5. Mapping and Scheduling for Low Energy
6. Real-Time Scheduling with Dynamic Voltage Scaling
Hardware/Software Codesign Low Power/Energy - 2
Remember the Design Flow
8/10/2019 embedded system architecture by Ralf Niemann
78/130
System model
Informal Specification,Constraints
FunctionalSimulation
Modeling
Arch. Selection
Systemarchitecture
Mapping
Estimation
Mapped andscheduled model
Scheduling
OK
not OK not OK
FormalVerification
Softw. model Hardw. model
Simulation
FormalVerification
Softw. Generation Hardw. Synthesis
Simulation
Hardware/Software Codesign Low Power/Energy - 3
Why is Power Consumption an Issue?
8/10/2019 embedded system architecture by Ralf Niemann
79/130
Petru Eles, IDA, LiTH
Portable systems - battery life time!
Systems with a very limited power budget: Mars Pathfinder,autonomous helicopter, ...
Desktops and servers: high power consumption
- raises temperature and deteriorates performance & reliability
- increases the need for expensive cooling mechanisms
One of the main difficulties with developing high performancechips is heat extraction.
High power consumption has economical and ecologicalconsequences.
Hardware/Software Codesign Low Power/Energy - 4
Sources of Power Dissipation in CMOS Devices
8/10/2019 embedded system architecture by Ralf Niemann
80/130
Petru Eles, IDA, LiTH
1
2--- C VDD
2f NSW QSC VDD f NSW Ileak VD+ +=
dynamic static
Switching powerPower required tocharge/dischargecircuit nodes
Short-circ. powerDissipation dueto short-circuitcurrent
Leakage powerDissipationdue to leakagecurrent
C = node capacitances
NSW= switching activities(number of gate transitions
per clock cycle)f = frequency of operation
VDD= supply voltage
QSC= charge carried byshort circuit current
per transitionIleak= leakage current
Hardware/Software Codesign Low Power/Energy - 5
Sources of Power Dissipation in CMOS Devices (contd)
8/10/2019 embedded system architecture by Ralf Niemann
81/130
Petru Eles, IDA, LiTH
source
draing
ate
body
gate
drain
source
Vbs
CMOS transistor (N-type)
Vbs= body bias voltage
Vth= threshold voltage
Threshold voltage:
- The minimal voltagerequired at the gate toturn on the transistor
Hardware/Software Codesign Low Power/Energy - 6
Sources of Power Dissipation in CMOS Devices (contd)
8/10/2019 embedded system architecture by Ralf Niemann
82/130
Petru Eles, IDA, LiTH
Vdd
CL
gate
drain
source
CMOS transistor (N-type)
Vbs= body bias voltageV
th= threshold voltage
Vdd= supply voltageCL= output load capacitance
CMOS inverter
Dynamic power
- Charging and discharging the
output load capacitance
- Momentary short circuits at a
gates output
source
draing
ate
bodyVbs
Hardware/Software Codesign Low Power/Energy - 7
Sources of Power Dissipation in CMOS Devices (contd)
8/10/2019 embedded system architecture by Ralf Niemann
83/130
Petru Eles, IDA, LiTH
Vdd
CL
gate
drain
source
CMOS transistor (N-type)
Vbs= body bias voltageVth= threshold voltage
Vdd= supply voltageCL= output load capacitance
CMOS inverter
Static power
- Subthreshold leakage
conduction- Junction leakage (drain
and source to body)
It flows even whenthe voltage at thegate is below Vth
source
drainga
te
bodyVbs
Hardware/Software Codesign Low Power/Energy - 8
Sources of Power Dissipation in CMOS Devices (contd)
8/10/2019 embedded system architecture by Ralf Niemann
84/130
Petru Eles, IDA, LiTH
For long:
Leakage power has been considered negligible compared todynamic.
Today:
Total dissipation from leakage is approaching the total from
dynamic.
As technology drops below 65nm: Leakage power is exceeding dynamic.
Hardware/Software Codesign Low Power/Energy - 9
Sources of Power Dissipation in CMOS Devices (contd)
8/10/2019 embedded system architecture by Ralf Niemann
85/130
Petru Eles, IDA, LiTH
Leakage power is consumed even if the circuit is idle (standby). Theonly way to avoid is decoupling from power.
Short circuit power can be around 10% of total.
Switching power is still the main source of power consumption.
For the rest of the discussion, we consider mainly switchingpower. At the end we come back to leakage.
Hardware/Software Codesign Low Power/Energy - 10
Power and Energy Consumption
8/10/2019 embedded system architecture by Ralf Niemann
86/130
Petru Eles, IDA, LiTH
P 1
2--- C VDD
2f NSW =
P t 12--- C VDD2 NCY NSW = =
NCY= number of cycles needed for the particular task.
In certain situations we are concerned about power consumption:
- heath dissipation, cooling:
- physical deterioration due to temperature.
Sometimes we want to reduce total energy consumed:- battery life.
Hardware/Software Codesign Low Power/Energy - 11
Reducing Power/Energy Consumption
8/10/2019 embedded system architecture by Ralf Niemann
87/130
Petru Eles, IDA, LiTH
The main sources:
Reduce supply voltage
Reduce switching activity
Reduce capacitance
Reduce number of cycles
Hardware/Software Codesign Low Power/Energy - 12
Reducing Power/Energy Consumption (contd)
Ci it l l
8/10/2019 embedded system architecture by Ralf Niemann
88/130
Petru Eles, IDA, LiTH
Circuit level Ordering of transistors in gate (influences capacitance).
Transistor sizing.
Logic level
Dont-care optimization to reduce switching activity.
Reduce spurious switching activity by balancing the delays ofpaths that converge at each gate.
Technology mapping.
State encoding such that switching activity is minimised: ifstate shas a large number of transitions to state q, theyshould be given uni-distant codes.
Encoding to minimise switching activity in arithmetic units oron the bus.
Gated clocks: Gate the clocks of circuits (registers, gates,arithmetic units when they are in idle time periods.
Hardware/Software Codesign Low Power/Energy - 13
Reducing Power/Energy Consumption (contd)
8/10/2019 embedded system architecture by Ralf Niemann
89/130
Petru Eles, IDA, LiTH
Behavioral level
Schedule and map operations so that number of cycles isminimised (with increased number of switching per clockcycle) you can run at slower clock rate you can reducesupply voltage.
Allocate and share modules so that power consumption isreduced (for example, by reducing switching activity)
Hardware/Software Codesign Low Power/Energy - 14
Reducing Power/Energy Consumption (contd)
A hit t l l
8/10/2019 embedded system architecture by Ralf Niemann
90/130
Petru Eles, IDA, LiTH
Architecture level
Specialise instruction set, datapath, register structure to theparticular architecture, with power consumption as an optimization
goal.- You have on the chip and you switch only those resources
(gates) you really need.
Reduce power consumption on the bus.- lower switching activity: clever encoding, reduce switching ac-
tivity on the address bus by exploiting correlations;
- minimise the bus length (capacitance) by optimal moduleplacement.
- bus segmentation: transform a long heavily loaded global businto a partitioned set of local bus segments.
Hardware/Software Codesign Low Power/Energy - 15
Reducing Power/Energy Consumption (contd)
O ti i th t t
8/10/2019 embedded system architecture by Ralf Niemann
91/130
Petru Eles, IDA, LiTH
Optimise the memory structure.
- Memory transfers are extremely power hungry: a memorytransfer takes 33 times more energy than an addition!
Reducing the number of memory accesses is a very efficientway to save power!
- Adapt the number of caches, their size and associativity, andthe length of the cache line to the application reducenumber of memory transfers.
- Interesting trade-off: larger caches consume more power but
reduce number of memory transfers find the right balance!
Hardware/Software Codesign Low Power/Energy - 16
Reducing Power/Energy Consumption (contd)
8/10/2019 embedded system architecture by Ralf Niemann
92/130
Petru Eles, IDA, LiTH
Provide instruction support for Power management:
- Instructions which allow to put in stand-by or shut down certainparts of the system.
- Instructions which allow to dynamically fix the supply voltage(dynamic voltage scaling).
Hardware/Software Codesign Low Power/Energy - 17
Reducing Power/Energy Consumption (contd)
8/10/2019 embedded system architecture by Ralf Niemann
93/130
Petru Eles, IDA, LiTH
System Level
Static techniques are applied at design time.
- Compilation for low power: instruction selection consideringtheir power profile, data placement in memory, registerallocation.
- Algorithm design: find the algorithm which is the most power-efficient.
- Task mapping and scheduling.
Dynamic techniques are applied at run time.
- These techniques are applied at run-time in order to reducepower consumption by exploiting idle or low-workload periods.
Hardware/Software Codesign Low Power/Energy - 18
System Level Power Optimization
8/10/2019 embedded system architecture by Ralf Niemann
94/130
Petru Eles, IDA, LiTH
Three techniques will be discussed:
1. Dynamic power management: a dynamic technique.
2. Task mapping: a static technique.
3. Task scheduling with dynamic power scaling: static & dynamic.
Hardware/Software Codesign Low Power/Energy - 19
Dynamic Power Management (DPM)
Decisions:
8/10/2019 embedded system architecture by Ralf Niemann
95/130
Petru Eles, IDA, LiTH
application
hardware
power aware OS
Decisions:
Switching among multiple powerstates:
idle
sleep
run
Switching among multiplefrequencies and voltage levels.
Goal:
Energy optimization
QoS constraints satisfied
Hardware/Software Codesign Low Power/Energy - 20
Dynamic Power Management (contd)
Hardware Support (e g Intel Xscale Processor)
8/10/2019 embedded system architecture by Ralf Niemann
96/130
Petru Eles, IDA, LiTH
Hardware Support (e.g. Intel Xscale Processor)
RUNRUNRUN
RUN
IDLE SLEEP
RUN
0.75V, 60mW150MHz
1.3V, 450mW600MHz
1.6V, 900mW800MHz
90s
40mW 160W
10s
10s 140ms
1.5ms
160s
RUN: operational
IDLE: Clocks to theCPU are disabled;recovery is throughinterrupt.
SLEEP: Mainly
powered off;recovery throughwake-up event.
Other intermediate
states: DEEPIDLE, STANDBY,DEEP SLEEP
Hardware/Software Codesign Low Power/Energy - 21
Dynamic Power Management (contd)
8/10/2019 embedded system architecture by Ralf Niemann
97/130
Petru Eles, IDA, LiTH
DPM techniques are used in laptops, personal digital assistants(PDAs), and other portable appliances in order to shut down orplace in stand-by unused devices.The goal is power saving.
DPM techniques are implemented in the operating system(including Windows 2000 running on laptops).
The power breakdown for a laptop computer:- 36% of total power consumed by the display
- 18% by hard-disk
- 18% by wireless LAN interface
- 7% by keyboard, mouse, etc.- 21% by digital VLSI circuits.
dont forgetthese!
Hardware/Software Codesign Low Power/Energy - 22
The Basic Concept of DPM
Wh th t f d i th d i i b
8/10/2019 embedded system architecture by Ralf Niemann
98/130
Petru Eles, IDA, LiTH
When there are requests for a device the device is busy;otherwise it is idle.
When the device is idle, it can be shut down to enter a low-powersleeping state.
BusyBusy
Working WorkingSleeping
T1 T2 T3 T4
Device state
Power state
Workload
Time
Requests Requests
Idle
Tsd Twu
?
8/10/2019 embedded system architecture by Ralf Niemann
99/130
Hardware/Software Codesign Low Power/Energy - 24
Power Management Policies
Power management policies are concerned with predictions
8/10/2019 embedded system architecture by Ralf Niemann
100/130
Petru Eles, IDA, LiTH
Power management policies are concerned with predictionsrelated to idle periods:
- For shut-down: try to predict how long the idle period will be in
order to decide if a shut-down should be performed.
- For wake-up: try to predict when the idle period ends, in orderto avoid user delays due to Twu.It is quite difficult, and often the wake-up is started simplywhen a request has arrived.
Typical Policies:
1. Time-out
2. Predictive
3. Stochastic
8/10/2019 embedded system architecture by Ralf Niemann
101/130
Hardware/Software Codesign Low Power/Energy - 26
Predictive Policy
The length of an idle period is predicted. If the prediction is for an idleperiod long enough, the shut-down is performed immediately (no time
8/10/2019 embedded system architecture by Ralf Niemann
102/130
Petru Eles, IDA, LiTH
period long enough, the shut down is performed immediately (no timeinterval T1- T2on slide 16).
Policy
- L-shaped distribution for Idle PeriodPrevious Busy Period----------------------------------------------------;
Busy Period
Idle
Period
Short busy periodsare followed by
long idle periods.Busy periods longerthan a threshold are followed by
short idle periods.
Shut down aftershort busy period!
Hardware/Software Codesign Low Power/Energy - 27
Stochastic Policy
Predictions are based on Markov models: requests and power statetransitions of the device are modelled as probabilistic state machines
8/10/2019 embedded system architecture by Ralf Niemann
103/130
Petru Eles, IDA, LiTH
transitions of the device are modelled as probabilistic state machines.
The power manager observes the arriving requests, the requestqueue and the device generates shutdown commands.
Environment or user:generates requests
The device:provides service
requestqueue
Power manager
Markov model:
device
Markov model:
request generator
requests
ob
s.
obs.
obs.
com
man
ds
Hardware/Software Codesign Low Power/Energy - 28
Mapping and Scheduling for Low Energy
For many embedded systems DPM techniques like presented
8/10/2019 embedded system architecture by Ralf Niemann
104/130
Petru Eles, IDA, LiTH
For many embedded systems DPM techniques, like presentedbefore, cannot be applied:
They have no devices like hard-disk, no (or small) display VLSI is a main source of power dissipation.
They have time constraints we have to keep deadlines(usually we cannot afford shut-down and wake-up times).
The operating system is small no sophisticated techniques atrun-time.
The application is known at design time we know a lot aboutthe application already at design time.
Static techniques can be used (applied at design time).Mapping and scheduling for low energy are important!
Hardware/Software Codesign Low Power/Energy - 29
Mapping for Low Energy
1
8/10/2019 embedded system architecture by Ralf Niemann
105/130
Petru Eles, IDA, LiTH
8
5
7
3
6
4
2
p3 p4
Bus
TaskWCET Energy
p3 p4 p3 p4
1 5 6 5 32 7 9 8 4
3 5 6 5 3
4 8 10 6 4
5 10 11 8 66 17 21 15 10
7 10 14 8 7
8 15 19 14 9
Hardware/Software Codesign Low Power/Energy - 30
Mapping for Low Energy (contd)
Consider a mapping: Communication times and energy:
8/10/2019 embedded system architecture by Ralf Niemann
106/130
Petru Eles, IDA, LiTH
Consider a mapping: Communication times and energy:
p3: 1, 3, 6, 7, 8. C1-2: t = 1; E = 3. C3-5: t = 2; E = 5.
p4: 2, 4, 5. C4-8: t = 1; E = 3. C5-7: t = 1; E = 3.
Execution time: 52; Energy consumed: 75.
1
38 40 42 44 46 48 50 52 54 56 58 60 620 2 4 6 8 10 12 14 16 18 20 22 24 26 30 32 3428 36Time 64
3
4
6 7 8p3
p4
bus
2 5
C1-2 C5-7C3-5 C4-8
Hardware/Software Codesign Low Power/Energy - 31
Mapping for Low Energy (contd)
Consider a mapping: Communication times and energy:
8/10/2019 embedded system architecture by Ralf Niemann
107/130
Petru Eles, IDA, LiTH
Consider a mapping: Communication times and energy:
p3: 1, 3, 6, 7. C1-2: t = 1; E = 3. C3-5: t = 2; E = 5.
p4: 2, 4, 5, 8. C7-8: t = 1; E = 3. C5-7: t = 1; E = 3.
Execution time: 57; Energy consumed: 70.
1
38 40 42 44 46 48 50 52 54 56 58 60 620 2 4 6 8 10 12 14 16 18 20 22 24 26 30 32 3428 36Time 64
3
4
6 7
8
p3
p4
bus
2 5
C1-2 C5-7C3-5 C7-8
Hardware/Software Codesign Low Power/Energy - 32
Mapping for Low Energy (contd)
8/10/2019 embedded system architecture by Ralf Niemann
108/130
Petru Eles, IDA, LiTH
The second mapping with 8on p4consumes less energy;
Assume that we have a maximum allowed delay = 60.
This second mapping is preferable, even if it is slower!
8/10/2019 embedded system architecture by Ralf Niemann
109/130
Hardware/Software Codesign Low Power/Energy - 34
Real-Time Scheduling with Dynamic Voltage Scaling (contd)
The scheduling problem:
8/10/2019 embedded system architecture by Ralf Niemann
110/130
Petru Eles, IDA, LiTH
The scheduling problem:
Which task to execute at a certain moment on a certain processor sothat time constraints are fulfilled?
The scheduling problem with voltage scaling:
Which task to execute at a certain moment on a certain processor, and
at which voltage level, so that time constraints are fulfilled and energyconsumption is minimised?
The problem: reducing supply voltage extends execution time!
Hardware/Software Codesign Low Power/Energy - 35
Variable Voltage Processors
8/10/2019 embedded system architecture by Ralf Niemann
111/130
Petru Eles, IDA, LiTH
Several supply voltage levels are available.
Supply voltage can be fixed by the application (operating system)through execution of particular instructions.
Frequency is automatically adjusted to the current supply voltage.
Several processors with variable voltage levels are already
available. There will be more and more in the near future.
Hardware/Software Codesign Low Power/Energy - 36
The Basic Principle
We consider a single task :
total comp tation 109 e ec tion c cles
8/10/2019 embedded system architecture by Ralf Niemann
112/130
Petru Eles, IDA, LiTH
- total computation: 109execution cycles.
- deadline: 25 seconds.
- processor nominal (maximum) voltage: 5V.
- energy: 40 nJ/cycle at nominal voltage.
- processor speed: 50MHz (50106cycles/sec) at nominal voltage.
0 5 10 15 20 25 time (sec)
V2
52
slack
Etotal= 40 J
texe= 20 sec
109cycles
Hardware/Software Codesign Low Power/Energy - 37
The Basic Principle (contd)
Lets make it slower!
V 2 5V
8/10/2019 embedded system architecture by Ralf Niemann
113/130
Petru Eles, IDA, LiTH
VDD= 2.5V
- energy: 402.52/52= 10nJ/cycle.
- speed: 502.5/5 = 25MHz
0 5 10 15 20 25 time (sec)
V2
52
Etotal= 32.5 J
texe= 25 sec
2.52
750106cycles 250106cycles
Hardware/Software Codesign Low Power/Energy - 38
The Basic Principle (contd)
VDD= 4V
8/10/2019 embedded system architecture by Ralf Niemann
114/130
Petru Eles, IDA, LiTH
DD
- energy: 4042/52= 25nJ/cycle.
- speed: 504/5 = 40MHz
0 5 10 15 20 25 time (sec)
V2
52
Etotal= 25 J
texe= 25 sec42
109cycles
Hardware/Software Codesign Low Power/Energy - 39
The Basic Principle (contd)
If a processor uses a single supply voltage and completes aprogram just on deadline, the energy consumption is minimised.
8/10/2019 embedded system architecture by Ralf Niemann
115/130
Petru Eles, IDA, LiTH
Consider two tasks 1, 2:
Computation- 1: 25010
6execution cycles; 2: 750106execution cycles;
Deadline: 25 seconds.
Processor nominal (maximum) voltage: 5V.
Energy:
- 40 nJ/cycle at nominal voltage.
- 25 nJ/cycle at VDD= 4V.
Processor speed:
- 50MHz (50106cycles/sec) at nominal voltage.
- 40MHz at VDD= 4V.
1
2
Hardware/Software Codesign Low Power/Energy - 40
The Basic Principle (contd)
Find the voltage so that the tasks just meet their deadline you
8/10/2019 embedded system architecture by Ralf Niemann
116/130
Petru Eles, IDA, LiTH
g j yhave minimised energy consumption!
0 5 10 15 20 25 time (sec)
V2
Etotal= 25 J42
750106cycles250106
cycles
1 2
Hardware/Software Codesign Low Power/Energy - 41
Considering Task Particularities
Energy consumed by a task: NSW= number of gate transitionsper clock cycle
8/10/2019 embedded system architecture by Ralf Niemann
117/130
Petru Eles, IDA, LiTH
1
2--- C VDD
2NCY NSW =
Average energy consumed by task per cycle:
ECY1
2--- C VDD
2NSW =
Often tasks differ from each other in terms of executed operationsNSWand Cdiffer from one task to the other.
The average energy consumed per cycle differs from task to task.
per clock cycle.
C = switched capacitance perclock cycle.
Hardware/Software Codesign Low Power/Energy - 42
Considering Task Particularities (contd)
Consider two tasks 1, 2: Computation
8/10/2019 embedded system architecture by Ralf Niemann
118/130
Petru Eles, IDA, LiTH
p
- 1: 250106execution cycles; 2: 75010
6execution cycles;
Deadline: 25 seconds.
Processor nominal (maximum) voltage: 5V.
Processor speed:
- 50MHz (50106cycles/sec) at nominal voltage.
- 40MHz at VDD= 4V.- 25MHz at VDD= 2.5V.
Energy 1- 50 nJ/cycle at VDD= 5V.
- 32 nJ/cycle at VDD= 4V.- 12.5 nJ/cycle at VDD= 2.5V.
Energy 2- 12.5 nJ/cycle at VDD= 5V.
- 8 nJ/cycle at VDD= 4V.- 3 nJ/cycle at VDD= 2.5V.
1
2
Hardware/Software Codesign Low Power/Energy - 43
Considering Task Particularities (contd)
Here we have a solution with VDD= 4V, and deadline just fulfilled:
8/10/2019 embedded system architecture by Ralf Niemann
119/130
Petru Eles, IDA, LiTH
Etotal= 32nJ/cycle 250 106cycles + 8nJ/cycle 750 106cycles
0 5 10 15 20 25 time (sec)
V2
Etotal= 14 J42
750106cycles250106
cycles
1 2
Hardware/Software Codesign Low Power/Energy - 44
Considering Task Particularities (contd)
Here we run 1at VDD= 2.5V, and 2at VDD= 5V; the tasks finishj st on deadline
8/10/2019 embedded system architecture by Ralf Niemann
120/130
Petru Eles, IDA, LiTH
just on deadline.
Etotal= 12.5nJ/cycle 250 106cycles + 12.5nJ/cycle 750 10
6cycles
0 5 10 15 20 25 time (sec)
V2
52
Etotal= 12.5 J
2.52
750106cycles250106cycles
1
2
Hardware/Software Codesign Low Power/Energy - 45
Considering Task Particularities (contd)
8/10/2019 embedded system architecture by Ralf Niemann
121/130
Petru Eles, IDA, LiTH
If power consumption per cycle is not constant (but differs from taskto task), the rule on slide 33 is not true any more.
Voltage levels have to be reduced with priority for those tasks whichhave a larger energy consumption per cycle.
One particular voltage level has to be established for each task, sothat deadlines are just satisfied.
Hardware/Software Codesign Low Power/Energy - 46
Discrete Voltage Levels
Practical microprocessors can work only at a finite number of discretevoltage levels.
8/10/2019 embedded system architecture by Ralf Niemann
122/130
Petru Eles, IDA, LiTH
g
The ideal voltage Videal, determined for a certain task does not exist.
A task is supposed to run for time texeat the voltage Videal.
On the particular processor the two closest available neighbours toVidealare: V1< Videal< V2.
You have minimised the energy if you run the task for time t1atvoltage V1and for t2at voltage V2, so that t1+ t2= texe.
8/10/2019 embedded system architecture by Ralf Niemann
123/130
Hardware/Software Codesign Low Power/Energy - 48
The Pitfalls with Ignoring Leakage
E NC C eff Vdd2
Lg Vdd K3 eK
4 Vdd
eK
5 Vbs
Vbs Iju+( ) t +=
8/10/2019 embedded system architecture by Ralf Niemann
124/130
Petru Eles, IDA, LiTH
Minimise this andignore the rest!
Hardware/Software Codesign Low Power/Energy - 49
The Pitfalls with Ignoring Leakage
E NC C eff Vdd2
Lg Vdd K3 eK
4 Vdd
eK
5 Vbs
Vbs Iju+( ) t +=
8/10/2019 embedded system architecture by Ralf Niemann
125/130
Petru Eles, IDA, LiTH
1. We dont optimize global energy but only a part of it!
2. We can get it even very wrong and increaseenergy
consumption!
eff dd g dd 3 bs ju
Leakage decreaseswith Vdd, but growthwith time!
Dynamic decreaseswith Vddregardlessof increased time.
Hardware/Software Codesign Low Power/Energy - 50
E NC Ceff
Vdd
2 L
g V
dd K
3 e
K4
VddeK
5 Vbs
Vbs
Iju
+( ) t +=
8/10/2019 embedded system architecture by Ralf Niemann
126/130
Petru Eles, IDA, LiTH
0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95
1e-10
2e-10
3e-104e-10
5e-10
6e-10
7e-10
8e-10
10.50
Dynamic energy
Vdd
Energy
perCycle
Jejurikar et. al., DAC04
70nm technology, Crusoe processor
Hardware/Software Codesign Low Power/Energy - 51
E NC C eff Vdd2
Lg Vdd K3 e
K4
Vdd
e
K5
Vbs
Vbs Iju+( ) t +=
8/10/2019 embedded system architecture by Ralf Niemann
127/130
Petru Eles, IDA, LiTH
0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95
1e-10
2e-10
3e-104e-10
5e-10
6e-10
7e-10
8e-10
10.50
Leakage energy
Dynamic energy
Vdd
EnergyperCycle
Jejurikar et. al., DAC04
70nm technology, Crusoe processor
Hardware/Software Codesign Low Power/Energy - 52
E NC Ceff
Vdd
2 L
g
Vdd
K3
eK
4 Vdd
eK
5 Vbs
Vbs
Iju
+( ) t +=
C iti l i t!
8/10/2019 embedded system architecture by Ralf Niemann
128/130
Petru Eles, IDA, LiTH
0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95
1e-10
2e-10
3e-104e-10
5e-10
6e-10
7e-10
8e-10
10.50
Leakage energy
Dynamic energy
Dynamic + Leakage
Vdd
Energy
perCycle
Jejurikar et. al., DAC04
70nm technology
Critical point!If you go beyond this
with Vddenergy grows
Hardware/Software Codesign Low Power/Energy - 53
Summary
Power consumption becomes a central issue for embedded
8/10/2019 embedded system architecture by Ralf Niemann
129/130
Petru Eles, IDA, LiTH
systems design.
Power/energy consumption can be reduced by reducing supplyvoltage, switching activity, switched capacitance, number ofexecuted cycles.
There are means at all levels of the design to reduce powerconsumption: circuit, logic, behavioral, architecture, system level.
At system level we distinguish dynamic techniques (applied duringrun-time) and static techniques (applied at design time).
Hardware/Software Codesign Low Power/Energy - 54
Summary (contd)
Dynamic power management is implemented by the operatingsystem, and is mainly used in portable appliances to shut down orplace in stand by unused devices
8/10/2019 embedded system architecture by Ralf Niemann
130/130
Petru Eles, IDA, LiTH
place in stand-by unused devices.
Typical policies for power management are: time-out, predictive,and stochastic.
Both at task mapping and at scheduling, design decisions can be
made with have a huge impact on power/energy consumption.
Real-time scheduling in the context of processors with voltagescaling is extremely interesting. The main trade-off is voltage levelvs. execution time. One has to find the optimal voltage levels such
that energy consumption is reduced and deadlines are stillfulfilled.