embedded system architecture by Ralf Niemann

Embed Size (px)

Citation preview

  • 8/10/2019 embedded system architecture by Ralf Niemann

    1/130

    Hardware/Software Codesign

    of Embedded Systems

    Petru Eles and Zebo Peng

    Embedded Systems Laboratory (ESLAB)Linkping University

    Embedded Tutorial

    Lecture ContentsLecture Contents

    =

    Introduction and basic issues.

    = Architectures and platforms.

    = Analysis, co-simulation, and design space

  • 8/10/2019 embedded system architecture by Ralf Niemann

    2/130

    3Prof. Z. Peng, ESLAB/LiTH

    IntroductionIntroduction

    = Codesign of embeddedsystems

    = Definition and motivation

    = The design flows

    =

    System level design issues

    Traditional Design FlowTraditional Design Flow

    Informal System Specification

    Early, Manual Partitioning

    HW SpecificationSW Specification

  • 8/10/2019 embedded system architecture by Ralf Niemann

    3/130

    5Prof. Z. Peng, ESLAB/LiTH

    timetime

    Design TimeDesign Time

    Specification

    & Partitioning

    HW Design

    &Simulation

    SW Design&

    Simulation

    Integration&

    Test

    Traditional Design: HW/SW Codesign:

    Specification

    & Partitioning

    HW Design

    &Simulation

    SW Design&

    Simulation

    Integration&

    Test

    Co-sim.

    &Co-verif.

    Reduced TTM

    HW/SW CodesignHW/SW Codesign

    = The concurrent design of hardware and

    software elements, supporting explicithardware/software trade-off.

    0 Co-specification to create an commonspecification that describes both hardware andsoftware elements

  • 8/10/2019 embedded system architecture by Ralf Niemann

    4/130

    7Prof. Z. Peng, ESLAB/LiTH

    Why Codesign?Why Codesign?

    =

    Reduce time-to-market.= Achieve better designs:

    0 More design alternatives can be explored.

    0 Better solutions can be found by advanced optimizationtechniques.

    = To meet strict design constraints, such as:

    0 Timing or performance constraints.0 Power dissipation.

    0 Physical constraints, e.g., size, weight, etc.

    0 Safety and reliability constraints.

    0 Cost constraints.

    = Codesign is also made possible by the advances in

    design methodologies and tools.

    Vertical CodesignVertical Codesign

    = Instruction set processor design, for both general-purpose systems and ASIPs (Application Specific

    Instruction Processors).

    To determine how big the

    hardware engine you need to

    Specification

  • 8/10/2019 embedded system architecture by Ralf Niemann

    5/130

    9Prof. Z. Peng, ESLAB/LiTH

    Codesign of ProcessorsCodesign of Processors

    = General-Purpose Processors0 Architectural support for operating systems.

    0 Cache design and tuning (e.g., selection of cachesize and control schemes).

    0 Pipeline control design (control mechanisms,compiler design).

    = ASIPs

    0 Customization of instruction sets and specificresources (e.g., accelerator and coprocessor).

    0 Design of register files, busses andinterconnections.

    0 Development of specific compiler.

    Horizontal CodesignHorizontal Codesign

    = Some of system functionality is implemented insoftware running on programmable CPUs, while other

    functions are implemented in hardware.= Typical for design of embedded systems.

    SpecificationCodesign of

    Specialized processor

  • 8/10/2019 embedded system architecture by Ralf Niemann

    6/130

    11Prof. Z. Peng, ESLAB/LiTH

    What is an Embedded System?What is an Embedded System?

    = There are many different definitions!0 A special-purpose computer system that is used for a

    particular task.

    0 A computer based systems embedded in real lifemachines. Though computer based, it dose not have theusual key-board and monitors. The processor and relatedcircuitry are configured to do a specific task.

    = Some highlights what it is (not) used for:0 Any device which includes a programmable component but

    itself is not intended to be a general purpose computer.

    = Some focus on what it is built from:0 A collection of programmable parts surrounded by ASICs

    and other standard components, that interact continuously

    with an environment through sensors and actuators.

    Characteristics of an Embedded SystemCharacteristics of an Embedded System

    = Dedicated (not general purpose).

    0 One or several applications known at design-time.

    = Contains a programmable component.0 But usually not programmable by the end-user.

    = Interacts (continuously) with the environment:

    0 Real-time behavior.

  • 8/10/2019 embedded system architecture by Ralf Niemann

    7/130

    13Prof. Z. Peng, ESLAB/LiTH

    1%

    99%

    Embedded Systems

    General purpose systems Embedded systems

    Microprocessormarket sharesin 1999

    Actuat

    Embedded ControllersEmbedded Controllers

    CPUMemorySenso

  • 8/10/2019 embedded system architecture by Ralf Niemann

    8/130

    15Prof. Z. Peng, ESLAB/LiTH

    I/O Interface

    Network Interface

    CPU

    RA M

    ROM

    ASIC

    Actuators Sensors

    Distributed Embedded SystemsDistributed Embedded Systems

    ECU ECU ECU

    Gateway

    Gateway

    ECU ECU ECU

    Time and Power ConstraintsTime and Power Constraints

    = Time constraints:

    0 They have to perform in real-time: if data are not ready by

    a certain deadline, the system fails to perform correctly.0 Hard deadline failure to meet leads to major hazards.

    0 Soft deadline failure to meet can be tolerated but quality

    of service is reduced.

    P t i t

  • 8/10/2019 embedded system architecture by Ralf Niemann

    9/130

    17Prof. Z. Peng, ESLAB/LiTH

    Safety Critical RequirementsSafety Critical Requirements

    = Embedded systems are often used in lifecritical applications.

    0 Avionics, automotive electronics, nuclear plants,medical applications, military applications, etc.

    = Reliability and safety are major requirements.

    = To guarantee correctness during design:0 Formal verification: Mathematics-based methods

    to verify certain properties of the designedsystem.

    0 Automatic synthesis: Certain design steps areautomatically performed by design tools

    Correctness by construction.

    Short Time to MarketShort Time to Market

    = In highly competitive markets it is critical to catchthe market window:

    0 A short delay with the product on the market can havecatastrophic financial consequences (even if the quality of

    the product is excellent).

    = Design time has to be reduced!

  • 8/10/2019 embedded system architecture by Ralf Niemann

    10/130

    19Prof. Z. Peng, ESLAB/LiTH

    The ES Design ChallengesThe ES Design Challenges

    =

    Increasing application complexity (e.g., automotive).= Heterogeneous architecture (HW, SW, network,

    mechatronics, etc.).

    = Stringent time and power constraints.

    = Low cost requirement.

    = Short time to market.

    = Safety and reliability (e.g., very long life-time).

    = In order to achieve all these requirements, systemshave to be highly optimized.

    = Both hardware and software aspects have to beconsidered simultaneously!

    Current Design PracticeCurrent Design Practice

    1. Start from some informal specification and a set ofconstraints (time, power, and cost constraints).

    2. Generate a more formal specification, based on somemodeling concept (FSM, data-flow, etc.), usingMatlab, Statecharts, SystemC, C, UML, or VHDL.

    3. Simulate the model in order to check itsfunctionality. The model is modified, if needed.

  • 8/10/2019 embedded system architecture by Ralf Niemann

    11/130

    21Prof. Z. Peng, ESLAB/LiTH

    The ConsequencesThe Consequences

    = Delays in the design process:0 Increased design cost.

    0 Delays in time to market missed market window.

    = High cost due to many iterations withimplementation and prototyping.

    = Bad design decisions taken under time pressure:0 Low quality.

    0 High cost.

    = The lesson: We need to explore more designalternatives in an efficient manner.0 At the system level!

    SystemSystem--Level DesignLevel Design

    Informal Specification,Constraints

    FormalVerification

    FunctionalSimulation

    System Model

    Modeling

    Arch. Selection

    SystemArchitecture

    Mapping

  • 8/10/2019 embedded system architecture by Ralf Niemann

    12/130

    23Prof. Z. Peng, ESLAB/LiTH

    The Improved Design FlowThe Improved Design Flow

    = Several design alternatives are evaluatedbefore going down to the lower-level design.

    0 This is performed as part of the design spaceexploration process.

    0 Different architectures, mappings and schedulesare explored, before the actual implementation

    and prototyping.

    = We get highly optimized solutions in shorttime.

    0 There is a good chance that design iterations atthe lower-level, including prototyping, can beavoided.

    Additional ImprovementsAdditional Improvements

    = Formal verification0 It is impossible to do an exhaustive simulation.

    0

    Especially for safety critical systems, formal verification isneeded.

    = Simulation0 Used not only for functional validation.

    0 Should also be used after mapping and scheduling in orderto check, for example, timing properties.

  • 8/10/2019 embedded system architecture by Ralf Niemann

    13/130

    25Prof. Z. Peng, ESLAB/LiTH

    The LowerThe Lower--Level IssuesLevel Issues

    = Software generation:0 Encoding in an implementation language (C, C++,

    assembler).

    0 Compiling (this can include particular optimizations forapplication specific processors, DSPs, etc.).

    0 Generation of a real-time kernel or adapting to an existingoperating system.

    = Hardware synthesis:0 Encoding in a HDL (VHDL and Verilog).0 Successive synthesis steps: high-level, register-transfer

    level, logic-level synthesis.

    = Hardware/software integration:0 The software is run together with the hardware model

    (co-simulation).

    = Prototyping:0 A prototype of the hardware is constructed and the

    software is executed on the target architecture.

    LowerLower--Level DesignLevel Design

    There are established CAD tools on the market whichautomatically perform many of the low level tasks:

    = Code generators (software model C, hardwaremodel VHDL)

    = Compilers.

    H d th i t l

  • 8/10/2019 embedded system architecture by Ralf Niemann

    14/130

    27Prof. Z. Peng, ESLAB/LiTH

    Focus on SystemFocus on System--Level DesignLevel Design

    = Have huge influence on the quality of the finalimplementation.

    = Very few commercial tools are available.

    = Mostly experimental and academic tools available.

    = Huge efforts and investments are currently made in

    order to develop tools and methodologies for systemlevel design.

    = Ad-hoc solutions are less and less acceptable.

    = It is the system level we are mainly interested, in

    this course!

    Concluding RemarksConcluding Remarks

    = Codesign provides the capability to make

    explicit and efficient hardware/softwaretrade-off.

    = Codesign of embedded systems have manyadvantages and challenges.

  • 8/10/2019 embedded system architecture by Ralf Niemann

    15/130

    Analysis, Co-Simulation

    and Design Space Exploration

    Zebo Peng

    Embedded Systems Laboratory (ESLAB)Linkping University

    OutlineOutline

    = Static analysis techniques

    = Design space exploration

  • 8/10/2019 embedded system architecture by Ralf Niemann

    16/130

    3Prof. Z. Peng, ESLAB/LiTH

    The Design SpaceThe Design Space

    = Very large due to many solution parameters:

    0 architectures and components

    0 hardware/software partitioning

    0 mapping and scheduling

    0 operating systems and global control

    0 communication synthesis

    Source: S3

    Source: Stratus

    Computers

    Hardware Software

    Embedded

    memory

    DSP

    Network

    High-speed electronicsSensor

    Analog

    circuit

    ASIC

    Microprocessor

    SoC

    Design Space ExplorationDesign Space Exploration

    What are needed in order to explore the complexdesign space to find a good solution:

    = Exploration in the higher level of abstractions.

    = Development of high-level analysis and estimationtechniques.

    = Employment of very fast exploration algorithms

  • 8/10/2019 embedded system architecture by Ralf Niemann

    17/130

    5Prof. Z. Peng, ESLAB/LiTH

    The Optimization ProblemThe Optimization Problem

    The majority of design space exploration tasks can beviewed as optimization problems:

    To find

    0 the architecture (type and number of processors, memory

    modules, and communication blocks, as well as their

    interconnections),

    0 the mapping of functionality onto the architecturecomponents, and

    0 the schedules of basic functions and communications,

    such that a cost function (in terms of implementationcost, performance, power, etc.) is minimized and aset of constraints is satisfied.

    The System Partitioning ProblemThe System Partitioning Problem

    5

    8

    35

    2

    3

    45

    5

    4

    35

    5

    6

    5665

    24

    20

    40

    67

    15

    23

    Two-way partitioning

  • 8/10/2019 embedded system architecture by Ralf Niemann

    18/130

    7Prof. Z. Peng, ESLAB/LiTH

    Hardware/Software PartitioningHardware/Software Partitioning

    Input: Implementation independent systemspecification consisting of interactingprocesses (e.g., VHDL).

    Output: Two sets of processes, assigned for hardwareand software implementation respectively.

    Target architecture:

    - Microprocessors

    - ASICs

    - Shared memories

    Hardware/Software PartitioningHardware/Software Partitioning

    Assumptions:

    = Microprocessor and ASIC working in parallel;

    = Reducing the amount of communication betweenthe microprocessor and hardware improves the

    overall performance.

    Objectives:

  • 8/10/2019 embedded system architecture by Ralf Niemann

    19/130

    9Prof. Z. Peng, ESLAB/LiTH

    Hardware/Software PartitioningHardware/Software Partitioning

    = Quantitative values can be derived via simulation,profiling, or static analysis of the specification.

    Ex.

    0 computation load(CL) number of operations executed

    by a basic region or process of the specification.

    0 communication intensity(CI) total number of

    communication operations on a channel between twoprocesses.

    = Performance improvement based on:

    0 Placing computation intensive processes into hardware.

    0 Increasing parallelism.

    0 Reducing inter-domain communication.

    Process Graph FormulationProcess Graph Formulation

    = nodes correspond to processes, which could beprocesses or basic blocks in the original specification

    (e.g., VHDL).= node weights reflect the degree of suitability for

    hardware implementation of the correspondingprocess:

    the computation load of the process;

  • 8/10/2019 embedded system architecture by Ralf Niemann

    20/130

    11Prof. Z. Peng, ESLAB/LiTH

    Process Graph FormulationProcess Graph Formulation

    = The Graph Partitioning Problem:To partition the process graph into two groups such

    that the sum of the weights of the cut edges will beminimal, subject to a set of constraints:

    Ex.

    HiH

    i MaxtH cos_ Physical limitation of silicon area

    HwiLimWNi 1

    Implement a node in HW, when

    it is appropriate.

    Features of CO ProblemsFeatures of CO Problems

    = Most CO problems, e.g., system partitioning with

    constraints, for digital system designs are NP-

    compete.

    = The time needed to solve an NP-compete problemgrows exponentially with respect to the problem sizen.

  • 8/10/2019 embedded system architecture by Ralf Niemann

    21/130

    13Prof. Z. Peng, ESLAB/LiTH

    Features of CO ProblemsFeatures of CO Problems

    = Many CO problems can be formulated as an IntegerLinear Programming (ILP) problem, and solved by anILP solver.

    = It is inherently more difficult to solve an ILP problemthan the corresponding Linear Programming problem.

    = The size of problem that can be solved successfully

    by ILP algorithms is an order of magnitude smallerthan the size of LP problems that can be easilysolved.

    HeuristicsHeuristics

    = A heuristic seeks near-optimal solutions at areasonable computational cost without being able to

    guarantee either optimality or feasibility.= Motivations:

    0 Many exact algorithms involve a huge amount ofcomputation effort.

    0 The decision variables have frequently complicated

  • 8/10/2019 embedded system architecture by Ralf Niemann

    22/130

    15Prof. Z. Peng, ESLAB/LiTH

    Heuristic Approaches to COHeuristic Approaches to CO

    Problem specific Generic methods

    Clustering

    List scheduling

    Left-edge algorithm

    Branch and bound

    Divide and conquer

    Constructive

    Transformational

    (Iterativeimprovemen

    t)

    Kernighan-Lin

    algorithm

    Neighborhood search

    Simulated annealing

    Tabu search Genetic algorithms

    (MetalH

    euris

    tics)

    Clustering for System PartitioningClustering for System Partitioning

    = Each node initially belongs to its own cluster, andclusters are then gradually merged until the desiredpartitioning is found.

    = The merge operation is selected based on localinformation (closeness metrics), rather than globalview of the whole system.

    v22

    v23

  • 8/10/2019 embedded system architecture by Ralf Niemann

    23/130

    17Prof. Z. Peng, ESLAB/LiTH

    The KernighanThe Kernighan--Lin Algorithm (KL)Lin Algorithm (KL)

    = A graph is partitioned into two clusters ofarbitrary size, by minimizing a givenobjective function.

    = KL is based on an iterative partitioningstrategy:

    0 The algorithm starts with two arbitrary clustersC1 and C2.

    0 The partitioning is then iteratively improved bymoving nodes between the clusters.

    0 At each iteration, the node which produces theminimal value of the cost function is moved; thisvalue can, however, be greater than the value

    before moving the node.

    BranchBranch--andand--BoundBound

    = Traverse an implicit tree to find the best leaf (solution).

    4-City TSP

    0 1 2 3

    0 3 6 410

    3

    0

    1

  • 8/10/2019 embedded system architecture by Ralf Niemann

    24/130

    19Prof. Z. Peng, ESLAB/LiTH

    BranchBranch--andand--Bound ExBound Ex0 1 2 3

    0 3 6 41

    0 40 5

    0 4

    0

    0

    1

    2

    3{0}

    {0,1}

    {0,1,2}

    {0,1,2,3}

    L = 88

    L 0

    L 3

    L 43

    {0,1,3}

    {0,1,3,2}

    L 8

    L = 18

    {0,2}L 6

    {0,2,1}

    L 46

    {0,2,1,3}

    L = 92

    {0,3}L 41

    {0,2,3}

    {0,2,3,1}

    L 10

    L = 18

    {0,3,1} {0,3,2}

    {0,3,1,2} {0,3,2,1}

    L 46 L 45

    L = 92 L = 88

    = Low-bound on the cost function.

    = Search strategy

    Neighborhood Search MethodNeighborhood Search Method

    = Step 1 (Initialization)(A) Select a starting solution xnow X.(B) xbest = xnow, best_cost = c(xbest).

    = Step 2 (Choice and termination)Choose a solution xnext N(xnow).If no solution can be selected or the terminating criteria apply,

    then the method stop

  • 8/10/2019 embedded system architecture by Ralf Niemann

    25/130

    21Prof. Z. Peng, ESLAB/LiTH

    Neighborhood Search MethodNeighborhood Search Method

    =

    The neighborhood search method is very attractive formany CO problems as they have a natural neighborhoodstructure, which can be easily defined and evaluated.0 Ex. Graph partitioning: swapping two nodes.

    5

    8

    35

    2

    3

    45

    5

    4

    35

    5

    6

    5665

    24

    20

    40

    67

    15

    23

    5

    8

    35

    2

    3

    45

    5

    4

    35

    5

    6

    5665

    24

    20

    40

    67

    15

    23

    The Descent MethodThe Descent Method

    = Step 1 (Initialization)

    = Step 2 (Choice and termination)

    Choose xnext N(xnow) such that c(xnext) < c(xnow), andterminate if no such xnext can be found.

    = Step 3 (Update)

    The descent process can easily be stuck at a local

  • 8/10/2019 embedded system architecture by Ralf Niemann

    26/130

    23Prof. Z. Peng, ESLAB/LiTH

    Dealing with Local OptimalityDealing with Local Optimality

    = Enlarge the neighborhood.

    Cost

    Solutions

    = Start with different initial solutions.

    X

    = To allow uphill moves:

    0 Simulated annealing

    0 Tabu search

    The SA AlgorithmThe SA Algorithm

    Select an initial solution xnow X;Select an initial temperature t> 0;

    Select a temperature reduction function ;RepeatRepeat

    Randomly select xnext N(xnow); = c o s t (xnext) - c o s t (xnow);

  • 8/10/2019 embedded system architecture by Ralf Niemann

    27/130

    25Prof. Z. Peng, ESLAB/LiTH

    A HW/SW Partitioning ExampleA HW/SW Partitioning Example

    35000

    40000

    45000

    50000

    55000

    60000

    65000

    70000

    75000

    0 200 400 600 800 1000 1200 1400

    Number of iterations

    Costfuncti

    onv

    alue

    optimum at iteration 1006

    Analysis TechniquesAnalysis Techniques

    = Analysis and simulation techniques are essential for

    hardware/software codesign:

    0 To guide the design space exploration.

    0 To provide feedback to the human designers.

    0 To support design validation.

    S l ti f l i / i l ti t h i i

  • 8/10/2019 embedded system architecture by Ralf Niemann

    28/130

    27Prof. Z. Peng, ESLAB/LiTH

    Performance MetricsPerformance Metrics

    =

    Extreme case performance0 Worst-case execution time

    0 Best-case execution time

    = Average case performance

    = Probabilistic performance

    0 Used in soft real-time applications

    0

    To accurately handle the variable execution time of tasks,which may be due to

    Application characteristics (e.g., data dependent loops);

    Architectural factors (e.g., cache misses);

    External factors (e.g., network load); or

    Insufficient knowledge.

    0 To guarantee a high probability of meeting timing

    constraints.

    SimulationSimulation--based Techniquesbased Techniques

    = Software Running the compiled programon the simulated target architecture.

    = Hardware Building a simulation model ofthe hardware and executing it to collectinformation.

    A very large number of inputs should be used

  • 8/10/2019 embedded system architecture by Ralf Niemann

    29/130

  • 8/10/2019 embedded system architecture by Ralf Niemann

    30/130

    31Prof. Z. Peng, ESLAB/LiTH

    Program Path AnalysisProgram Path Analysis

    = To determine what sequence of instructions will be

    executed in the worst case scenario.

    A basic block is composed of

    instructions in a straight line

    = Let us first assume thateach instruction takes afixed time to execute

    Program Path AnalysisProgram Path Analysis

    = Infeasible paths can be eliminated by dataflow analysis and path information provided

    by the programmer.= The number of feasible paths is typically

    exponential with the program size.

    Efficient methods are needed to avoid

  • 8/10/2019 embedded system architecture by Ralf Niemann

    31/130

    33Prof. Z. Peng, ESLAB/LiTH

    ILP FormulationILP Formulation

    Letxibe the number of times a basic block Bi is executed;

    cibe the execution time of the basic block Bi, which isassumed to be a constant.

    The total execution time of the program for a particularexecution is:

    =

    N

    iii xc

    1

    1

    10

    1

    11

    101

    C1

    C2

    C3

    C4

    C5

    C6

    C7

    C1+ C

    2+ C

    4+ 11 C

    5+ 10 C

    6+ C

    7

    ILP Formulation (ContILP Formulation (Contd)d)

    The estimated WCET of the program is:

    subject to a set of constraints Ax b.

    =

    N

    i

    ii xc

    1max

  • 8/10/2019 embedded system architecture by Ralf Niemann

    32/130

    35Prof. Z. Peng, ESLAB/LiTH

    An ExampleAn Example

    /* k >= 0 */s = k;while (k < 10) {

    if (ok)j++;

    else {j = 0;ok = true;

    }k++;

    }r = j;

    x1 s = k;B1

    d1

    d2

    x2 while (k

  • 8/10/2019 embedded system architecture by Ralf Niemann

    33/130

    37Prof. Z. Peng, ESLAB/LiTH

    Constraints IIConstraints II

    =

    Functionality constraints:

    Loop bound information

    0x1 x3 10x1Path information

    x5 1x1

    /* k >= 0 */s = k;while (k < 10) {

    if (ok)j++;

    else {j = 0;ok = true;

    }k++;

    }r = j;

    X1X2X3X4

    X5

    X6

    X7

    Remarks on Performance AnalysisRemarks on Performance Analysis

    = One of the main issues of hardware/software

    codesign is estimation and analysis.

    = Analysis of average and probabilistic performance

    can be done by simulation.

    = Worst case execution time analysis can only be

    ffi i l d b i l i h i

  • 8/10/2019 embedded system architecture by Ralf Niemann

    34/130

    39Prof. Z. Peng, ESLAB/LiTH

    SimulationSimulation

    = Applied usually directly to the designdescriptions, e.g. VHDL.

    = Can be used at different levels ofabstractions:

    0 System

    0 Algorithmic

    0 Register-transfer

    0 Logic

    0 Gate

    0 Switch and circuit

    CoCo--SimulationSimulation

    = How the hardware and software components are

    simulated at the same time?

    Problems:

    = Different simulation platforms are used;

    = Software runs fast while hardware simulation is

  • 8/10/2019 embedded system architecture by Ralf Niemann

    35/130

    41Prof. Z. Peng, ESLAB/LiTH

    Approaches to CoApproaches to Co--Simulation 1Simulation 1

    = Gate-level model of the processor

    0 Gate level simulation of the processor is very slow (tens ofclock cycles/sec).

    Ex. 10 cycles/sec, 1 GHz processor 100 million seconds(3.2 years) are needed to simulate one second of real time.

    0

    This provides a very accurate solution and is very simplefrom the co-simulation point of view.

    Gate-

    level

    model

    (VHDL)

    SW

    ASIC

    model

    (VHDL)

    VHDL

    simulation VHDL

    simulation

    Co-simulation framework

    Approaches to CoApproaches to Co--Simulation 2Simulation 2

    = Instruction-set architecture models

    ISA

    model

    (C

    progr.)

    SW

    ASICmodel

    (VHDL)

    Program

    running

    on hostVHDL

    simulation

  • 8/10/2019 embedded system architecture by Ralf Niemann

    36/130

  • 8/10/2019 embedded system architecture by Ralf Niemann

    37/130

    Hardware/Software Codesign Arch & Platf - 1

  • 8/10/2019 embedded system architecture by Ralf Niemann

    38/130

    Petru Eles, IDA, LiTH

    Architectures and Platforms

    1. Architecture Selection: The Basic Trade-Offs

    2. General Purpose vs. Application-Specific Processors

    3. Processor Specialisation

    4. ASIP Design Flow

    5. Specialisation of a VLIW ASIP

    6. Tool Support for Processor Specialisation

    7. Application Specific Platforms

    8. IP-Based Design (Design Reuse)9. Reconfigurable Systems

    Hardware/Software Codesign Arch & Platf - 2

  • 8/10/2019 embedded system architecture by Ralf Niemann

    39/130

    Remember the Design Flow

    System model

    Informal Specification,Constraints

    Functional

    Simulation

    Modeling

    Arch. Selection

    Systemarchitecture

    Mapping

    Estimation

    Mapped and

    scheduled model

    Scheduling

    OK

    not OK not OK

    FormalVerification

    Softw. model Hardw. model

    SimulationFormal

    Verification

    Softw. Generation Hardw. Synthesis

    Simulation

    Hardware/Software Codesign Arch & Platf - 3

  • 8/10/2019 embedded system architecture by Ralf Niemann

    40/130

    Petru Eles, IDA, LiTH

    Architecture Selection and Mapping

    Select the underlying hardware structure on which to run themodelled system.

    Map the functionality captured by the system over thecomponents of the selected architecture.Functionality includes processing and communication.

    Hardware/Software Codesign Arch & Platf - 4

  • 8/10/2019 embedded system architecture by Ralf Niemann

    41/130

    Architecture Selection

    Build a customised architecture strictlyoptimised for the particular application.

    Use a general purpose, existing platformand map the application on it.

    Use programmable processors

    running software.

    Use dedicated electronicsfixed

    reconfigurable

    or something in-between

    or both

    General

    Purposevs.ApplicationSpecific

    Softwarevs.Hardware

    Hardware/Software Codesign Arch & Platf - 5

  • 8/10/2019 embedded system architecture by Ralf Niemann

    42/130

    Petru Eles, IDA, LiTH

    Architecture Selection (contd)

    The trade-offs:

    Performance (high speed, low power consumption)

    Flexibility (how easy it is to upgrade or modify)

    Application specific

    General purpose

    Hardware

    Software

    high

    low

    high

    low

    Reconfigurablehardware

    Application specific

    General purpose

    Hardware

    Softwarehigh

    low

    high

    low

    Reconfigurablehardware

    Hardware/Software Codesign Arch & Platf - 6

  • 8/10/2019 embedded system architecture by Ralf Niemann

    43/130

    Petru Eles, IDA, LiTH

    Architecture Selection (contd)

    flexibility

    energy

    consumed

    low

    low

    med.

    med.

    high

    high

    orderof

    m

    agnitude

    o

    rderof

    ma

    gnitude

    ASIC

    FPGA

    ASIP

    GP proc.

    Hardware/Software Codesign Arch & Platf - 7

  • 8/10/2019 embedded system architecture by Ralf Niemann

    44/130

    Petru Eles, IDA, LiTH

    General Purpose vs. Application Specific Processors

    Both GP processors and ASIPs (application specific instruction setprocessors) can be RISCs, CISCs, DSPs, microcontrollers, etc.

    - One could look at DSPs and microcontrollers as being specificfor DSP and simple control applications respectively.

    - An application specific DSP or microcontroller is, however,more specialised thenjustfor DSP or control applications.

    GP processors

    - Neither instruction set nor microarchitecture or memorysystem are customised for a particular application or family ofapplications

    ASIPs

    - Instruction set, microarchitecture and/or memory system arecustomised for an application or family of applications.

    - What results is better performance and reduced powerconsumption.

    Hardware/Software Codesign Arch & Platf - 8

  • 8/10/2019 embedded system architecture by Ralf Niemann

    45/130

    Petru Eles, IDA, LiTH

    What Makes an ASIP Specific?

    What can we specialize in a processor?

    Instruction set (IS) specialisation

    Exclude instructions which are not used

    - reduces instruction word length (fewer bits needed for encoding);

    - keeps controller and data path simple.

    Introduce instructions, even exotic ones, which are specific to theapplication: combinations of arithmetic instructions (multiply-accumulate), small algorithms (encoding/decoding, filter), vector

    operations, string manipulation or string matching, pixel operations, etc.- reduces code sizereduced memory size, memory bandwidth,

    power consumption, execution time.

    Hardware/Software Codesign Arch & Platf - 9

  • 8/10/2019 embedded system architecture by Ralf Niemann

    46/130

    Petru Eles, IDA, LiTH

    What Makes an ASIP Specific?

    Function unit and data path specialisation

    Once an application specific IS is defined, this IS can be

    implemented using a more or less specific data path and more orless specific function units.

    Adaptation of word length.

    Adaptation of register number.

    Adaptation of functional units

    - Highly specialised functional units can be introduced for stringmatching and manipulation, pixel operation, arithmetics, and

    even complex units to perform certain sequences ofcomputations (co-processors).

  • 8/10/2019 embedded system architecture by Ralf Niemann

    47/130

    Hardware/Software Codesign Arch & Platf - 11

  • 8/10/2019 embedded system architecture by Ralf Niemann

    48/130

    Petru Eles, IDA, LiTH

    What Makes an ASIP Specific?

    Interconnect specialization

    Interconnect of functional modules and registers.

    Interconnect to memory and cache.

    - How many internal buses?

    - What kind of protocol?

    - Additional connections increase the potential of parallelism.

    Control specialisation

    Centralised control or distributed (globally asynchronous)?

    Pipelining?

    Out of order execution?

    Hardwired or microprogrammed?

    Hardware/Software Codesign Arch & Platf - 12

  • 8/10/2019 embedded system architecture by Ralf Niemann

    49/130

    Petru Eles, IDA, LiTH

    ASIP Design Flow

    (It can be seen as a part of the big design flow - slide 2)

    Algorithm(s)

    Simulator

    ProcessorArchitecture

    Compiler

    Performancenumbers

    Hardware/Software Codesign Arch & Platf - 13

  • 8/10/2019 embedded system architecture by Ralf Niemann

    50/130

    Petru Eles, IDA, LiTH

    A SOC for Multimedia Applications

    Glue logic

    A/D and D/A

    Controller(ASIP)

    On-chipmemory

    DSP(GP)

    VLIWprocessor

    (ASIP)

    The application specificController performsmaster control of thesystem and memory

    access control.

    The off-the-shelf (GP)DSP performs lesscomputation intensive

    modem and sound codecfunctions.

    The VLIW ASIP performscomputation intensivefunctions: discrete cosine

    and inverse discretecosine transforms,motion estimation, etc.

    This is a typical application specificplatform. Its structure has beenadapted for a family of applications.

    Besides GP processor cores, theplatform also consists of ASIP coreswhich themselves are specialised.

  • 8/10/2019 embedded system architecture by Ralf Niemann

    51/130

    Hardware/Software Codesign Arch & Platf - 15

  • 8/10/2019 embedded system architecture by Ralf Niemann

    52/130

    Petru Eles, IDA, LiTH

    Specialization of a VLIW ASIP (contd)

    Thats how an instruction word looks like:

    op4 op5 op6 op7 op8 op9 op10 op11op1 op2 op3

    Cluster 1 Cluster 2 Cluster 3

    Hardware/Software Codesign Arch & Platf - 16

  • 8/10/2019 embedded system architecture by Ralf Niemann

    53/130

    Petru Eles, IDA, LiTH

    Specialization of a VLIW ASIP (contd)

    Traditionally the datapath is organised as single register file shared byall functional units.

    Problem: Such a centralised structure does not scale!

    We increase the nr. of functional units in order to increase parallelism

    We have to increase the number of registers in the register file

    Internal storage and communication between functional units andregisters becomes dominant in terms of area, delay, and power.

    High performance VLIW processors are limited not by arithmeticcapacity but by internal bandwidth.

    Hardware/Software Codesign Arch & Platf - 17

  • 8/10/2019 embedded system architecture by Ralf Niemann

    54/130

    Petru Eles, IDA, LiTH

    Specialization of a VLIW ASIP (contd)

    A solution: clustering.

    Restrict the connectivity between functional units and registers, sothat each functional unit can read/write from/to a subset ofregisters.

    Organise the datapath as clusters of functional units and local

    register files.

    Nothing is for free!!!Moving data between registers belonging to different clusters takesmuch time and power!

    You have to drastically minimise the number of such moves by:- Carefully adapting the structure of clusters to the application.

    - Using very clever compilers.

    Hardware/Software Codesign Arch & Platf - 18

  • 8/10/2019 embedded system architecture by Ralf Niemann

    55/130

    Petru Eles, IDA, LiTH

    Specialization of a VLIW ASIP (contd)

    Instruction set specialisation: nothing special.

    Function unit and data path specialisation

    - Determine the number of clusters.

    - For each cluster determine

    - the number and type of functional units;

    - the dimension of the register file.

    Memory specialisation is extremely important because we need tostream large amounts of data to the clusters at high rate; one has

    to adapt the memory structure to the access characteristics of theapplication.

    - determine the number and size of memory banks

    Hardware/Software Codesign Arch & Platf - 19

  • 8/10/2019 embedded system architecture by Ralf Niemann

    56/130

    Petru Eles, IDA, LiTH

    Specialization of a VLIW ASIP (contd)

    Interconnect specialization

    - Determine the interconnect structure between clusters andfrom clusters to memory:

    - one or several buses,

    - crossbar interconnection

    - etc.

    Control specialisation:

    Thats more or less done, as we have decided for a VLIW

    processor.

    Hardware/Software Codesign Arch & Platf - 20

  • 8/10/2019 embedded system architecture by Ralf Niemann

    57/130

    Petru Eles, IDA, LiTH

    Tool Support for Processor Specialisation

    Look at the design flow on slide 12!

    In order to be able to generate a specialised architecture you need:

    Retargetable compiler

    Configurable simulator

    Hardware/Software Codesign Arch & Platf - 21

    R bl C il

  • 8/10/2019 embedded system architecture by Ralf Niemann

    58/130

    Petru Eles, IDA, LiTH

    Retargetable Compiler

    Retargetable compiler

    Algorithm

    Object code

    ProcessorArchitecture

    RetargetableCompiler

    Hardware/Software Codesign Arch & Platf - 22

    R t t bl C il ( td)

  • 8/10/2019 embedded system architecture by Ralf Niemann

    59/130

    Petru Eles, IDA, LiTH

    Retargetable Compiler (contd)

    An automatically retargetable compilercan be used for a range ofdifferent target architectures.

    The actual code optimization and code generation is done by thecompiler, based on a description of the target processor architecture.This description is formulated in a, so called, architecture descriptionlanguage.

    Having a good compiler is not only important for the processorspecialisation process!

    Once you have got your specialised ASIP you need a good compiler

    in order to efficiently make use of it!

    Hardware/Software Codesign Arch & Platf - 23

    C fi bl Si l t

  • 8/10/2019 embedded system architecture by Ralf Niemann

    60/130

    Petru Eles, IDA, LiTH

    Configurable Simulator

    Simulator

    Processor

    Architecture

    Performancenumbers

    Object code

    Such a simulator can beconfigured for a particulararchitecture (based on an

    architecture description)

    In this context, the mostimportant output produced by

    the simulator is performancenumbers:

    - throughput

    - delay

    - power/energy consumption

    Hardware/Software Codesign Arch & Platf - 24

    Application Specific Platforms

  • 8/10/2019 embedded system architecture by Ralf Niemann

    61/130

    Petru Eles, IDA, LiTH

    Application Specific Platforms

    Not only processors but also hardware platformscan be specialised

    for classes of applications.

    The platform will define a certain communication infrastructure

    (buses and protocols), certain processor cores, peripherals,accelerators commonly used in the particular application area, andbasic memory structure.

    Hardware/Software Codesign Arch & Platf - 25

    Application Specific Platforms (contd)

  • 8/10/2019 embedded system architecture by Ralf Niemann

    62/130

    Petru Eles, IDA, LiTH

    Application Specific Platforms (cont d)

    Proc.Core1 DMA Memory Bridge

    PeripheralRecon-

    figurable

    logic

    System bus

    Peripheral bus

    CacheProc.Core2

    Proc.Core3

    Peripheral

    Hardware/Software Codesign Arch & Platf - 26

    Application Specific Platforms (contd)

  • 8/10/2019 embedded system architecture by Ralf Niemann

    63/130

    Petru Eles, IDA, LiTH

    Application Specific Platforms (cont d)

    Design space exploration for platform definition:

    Simulator

    PlatformArchitecture

    Mapping/Compiling

    Performancenumbers

    Applications

    Hardware/Software Codesign Arch & Platf - 27

    Instantiating a Platform

  • 8/10/2019 embedded system architecture by Ralf Niemann

    64/130

    Petru Eles, IDA, LiTH

    Instantiating a Platform

    Once we have an application, the chip to implement on will not bedesigned as a collection of independently developed blocks, but will

    be an instance of an application specific platform.

    The hardware platform will be refined by

    - determining memory and cache size

    - identifying the particular cores, peripherals to be used

    - adding specific ASICs, accelerators

    - determining the amount of reconfigurable logic (if needed)

    Hardware/Software Codesign Arch & Platf - 28

    Instantiating a Platform (contd)

  • 8/10/2019 embedded system architecture by Ralf Niemann

    65/130

    Petru Eles, IDA, LiTH

    Instantiating a Platform (cont d)

    Simulator

    PlatformInstance

    Mapping/

    Compiling

    Performancenumbers

    Application

    PlatformArchitecture

  • 8/10/2019 embedded system architecture by Ralf Niemann

    66/130

    Hardware/Software Codesign Arch & Platf - 30

    IP-Based Design (Design Reuse)

  • 8/10/2019 embedded system architecture by Ralf Niemann

    67/130

    Petru Eles, IDA, LiTH

    IP Based Design (Design Reuse)

    The key concept in order to increase designers productivity is reuse.

    In order to manage the complexity of current large designs we do not

    start from scratch but reuse as much as possible from previousdesigns, or use commercially available pre-designed IP blocks.

    IP: intellectual property.

    Some people call this IP-based design, core-based design, reusetechniques, etc.:

    Core-based designis the process of composing a new system

    design by reusing existing components.

    Hardware/Software Codesign Arch & Platf - 31

    IP-Based Design (contd)

  • 8/10/2019 embedded system architecture by Ralf Niemann

    68/130

    Petru Eles, IDA, LiTH

    g ( )

    What are the blocks (cores) we reuse?

    interfaces, encoders/decoders, filters, memories, timers,microcontroller-cores, DSP-cores, RISC-cores, GP processor-cores.

    Possible(!) definition

    A coreis a design block which is larger than a typical RTLcomponent.

    Of course:We also reuse software components!

    Hardware/Software Codesign Arch & Platf - 32

    IP-Based Design (contd)

    Lib Lib

  • 8/10/2019 embedded system architecture by Ralf Niemann

    69/130

    Petru Eles, IDA, LiTH

    What we have designed here can be: An application specific SOC

    A platform to be further instantiated for a particular application.

    Core 1 Core 2 Core 3

    Library

    Vendor A

    Interconnection bus/switch

    Library

    Vendor B

    Core 4processor

    Library

    Vendor C

    Interface

    I/O

    glue glue glue

    glue

    Hardware/Software Codesign Arch & Platf - 33

    Types of Cores

  • 8/10/2019 embedded system architecture by Ralf Niemann

    70/130

    Petru Eles, IDA, LiTH

    yp

    Hard cores: are fully designed, placed, and routed by the supplier.

    Firm cores: technology-mapped gate-level netlists.

    A completely validated layout with definite timing

    rapid integration low flexibility

    less predictability flexibility duringplace and route

    Hardware/Software Codesign Arch & Platf - 34

    Types of Cores (contd)

  • 8/10/2019 embedded system architecture by Ralf Niemann

    71/130

    Petru Eles, IDA, LiTH

    Soft cores: synthesizable RTL or behavioral descriptions.

    Flexibility can provide opportunities like e.g. adding applicationspecific instructions to a processor core by modifying thebehavioral description.

    much work withintegration andverification.

    maximal flexibility

    Hardware/Software Codesign Arch & Platf - 35

    Reconfigurable Systems

  • 8/10/2019 embedded system architecture by Ralf Niemann

    72/130

    Petru Eles, IDA, LiTH

    Programmable Hardware Circuits:

    They implement arbitrary combinational or sequential circuits

    and can be configured by loading a local memory that determinesthe interconnection among logic blocks.

    Reconfiguration can be applied an unlimited number of times.

    Main applications:

    - Software acceleration

    - Prototyping

    Hardware/Software Codesign Arch & Platf - 36

    Reconfigurable Systems (contd)

  • 8/10/2019 embedded system architecture by Ralf Niemann

    73/130

    Petru Eles, IDA, LiTH

    Dynamic reconfiguration: spacial and temporal partitioning

    ---------------------------------------------------------------------------------------------------------------------------------------

    ---------------------------------------------------------------------------------------------------------------------------------------------------------------------

    Processor Memory

    FPGA

    Accelerator

    att1

    att2

    att3

    att4

    temporally

    partit

    ioned

    Hardware/Software Codesign Arch & Platf - 37

    Reconfigurable Systems (contd)

  • 8/10/2019 embedded system architecture by Ralf Niemann

    74/130

    Petru Eles, IDA, LiTH

    System on Chip with dynamically reconfigurable datapath

    Reconfigurabledatapath

    Onchip

    mem.

    CPU

    C code

    Profiling &Kernel

    extraction

    Hw/Swpartitioning

    Kernels

    C codeDatapathsynthesis

    Hardware/Software Codesign Arch & Platf - 38

    Summary

  • 8/10/2019 embedded system architecture by Ralf Niemann

    75/130

    Petru Eles, IDA, LiTH

    Architecture selection is about making trade-offs along thedimensions of speed, cost, flexibility, and power consumption.

    ASIPs are programmable processors, specialised for a particular

    application or for a family of applications.

    Specialisation of an ASIP concerns instruction set, function unitsand data path, memory system, interconnect, and control.

    Two design tools are of great importance in order to performprocessor specialisation: retargetable compiler and configurablesimulator.

    Not only processors can be specialised but also platforms. A

    Platform is specialised to execute a certain family of applications.The particular hardware to be used for a given application is aspecialised instantiation of the platform.

    Hardware/Software Codesign Arch & Platf - 39

    Summary (contd)

  • 8/10/2019 embedded system architecture by Ralf Niemann

    76/130

    Petru Eles, IDA, LiTH

    Reuse is a key technique in order to achieve high designproductivity. Cores to be reused can be from interfaces and

    decoders to filters and processors.

    The three types of cores differ in their flexibility, predictability, andthe effort needed for integration: hard, firm, and soft cores.

    Reconfigurable systems can provide good flexibility and, at thesame time, many of the advantages of classical hardwareimplementation. They are mainly used for software accelerationand prototyping.

    Hardware/Software Codesign Low Power/Energy - 1

    System-Level Power/Energy Optimization

  • 8/10/2019 embedded system architecture by Ralf Niemann

    77/130

    Petru Eles, IDA, LiTH

    1. Sources of Power Dissipation

    2. Reducing Power Consumption

    3. System Level Power Optimization

    4. Dynamic Power Management

    5. Mapping and Scheduling for Low Energy

    6. Real-Time Scheduling with Dynamic Voltage Scaling

    Hardware/Software Codesign Low Power/Energy - 2

    Remember the Design Flow

  • 8/10/2019 embedded system architecture by Ralf Niemann

    78/130

    System model

    Informal Specification,Constraints

    FunctionalSimulation

    Modeling

    Arch. Selection

    Systemarchitecture

    Mapping

    Estimation

    Mapped andscheduled model

    Scheduling

    OK

    not OK not OK

    FormalVerification

    Softw. model Hardw. model

    Simulation

    FormalVerification

    Softw. Generation Hardw. Synthesis

    Simulation

    Hardware/Software Codesign Low Power/Energy - 3

    Why is Power Consumption an Issue?

  • 8/10/2019 embedded system architecture by Ralf Niemann

    79/130

    Petru Eles, IDA, LiTH

    Portable systems - battery life time!

    Systems with a very limited power budget: Mars Pathfinder,autonomous helicopter, ...

    Desktops and servers: high power consumption

    - raises temperature and deteriorates performance & reliability

    - increases the need for expensive cooling mechanisms

    One of the main difficulties with developing high performancechips is heat extraction.

    High power consumption has economical and ecologicalconsequences.

    Hardware/Software Codesign Low Power/Energy - 4

    Sources of Power Dissipation in CMOS Devices

  • 8/10/2019 embedded system architecture by Ralf Niemann

    80/130

    Petru Eles, IDA, LiTH

    1

    2--- C VDD

    2f NSW QSC VDD f NSW Ileak VD+ +=

    dynamic static

    Switching powerPower required tocharge/dischargecircuit nodes

    Short-circ. powerDissipation dueto short-circuitcurrent

    Leakage powerDissipationdue to leakagecurrent

    C = node capacitances

    NSW= switching activities(number of gate transitions

    per clock cycle)f = frequency of operation

    VDD= supply voltage

    QSC= charge carried byshort circuit current

    per transitionIleak= leakage current

    Hardware/Software Codesign Low Power/Energy - 5

    Sources of Power Dissipation in CMOS Devices (contd)

  • 8/10/2019 embedded system architecture by Ralf Niemann

    81/130

    Petru Eles, IDA, LiTH

    source

    draing

    ate

    body

    gate

    drain

    source

    Vbs

    CMOS transistor (N-type)

    Vbs= body bias voltage

    Vth= threshold voltage

    Threshold voltage:

    - The minimal voltagerequired at the gate toturn on the transistor

    Hardware/Software Codesign Low Power/Energy - 6

    Sources of Power Dissipation in CMOS Devices (contd)

  • 8/10/2019 embedded system architecture by Ralf Niemann

    82/130

    Petru Eles, IDA, LiTH

    Vdd

    CL

    gate

    drain

    source

    CMOS transistor (N-type)

    Vbs= body bias voltageV

    th= threshold voltage

    Vdd= supply voltageCL= output load capacitance

    CMOS inverter

    Dynamic power

    - Charging and discharging the

    output load capacitance

    - Momentary short circuits at a

    gates output

    source

    draing

    ate

    bodyVbs

    Hardware/Software Codesign Low Power/Energy - 7

    Sources of Power Dissipation in CMOS Devices (contd)

  • 8/10/2019 embedded system architecture by Ralf Niemann

    83/130

    Petru Eles, IDA, LiTH

    Vdd

    CL

    gate

    drain

    source

    CMOS transistor (N-type)

    Vbs= body bias voltageVth= threshold voltage

    Vdd= supply voltageCL= output load capacitance

    CMOS inverter

    Static power

    - Subthreshold leakage

    conduction- Junction leakage (drain

    and source to body)

    It flows even whenthe voltage at thegate is below Vth

    source

    drainga

    te

    bodyVbs

    Hardware/Software Codesign Low Power/Energy - 8

    Sources of Power Dissipation in CMOS Devices (contd)

  • 8/10/2019 embedded system architecture by Ralf Niemann

    84/130

    Petru Eles, IDA, LiTH

    For long:

    Leakage power has been considered negligible compared todynamic.

    Today:

    Total dissipation from leakage is approaching the total from

    dynamic.

    As technology drops below 65nm: Leakage power is exceeding dynamic.

    Hardware/Software Codesign Low Power/Energy - 9

    Sources of Power Dissipation in CMOS Devices (contd)

  • 8/10/2019 embedded system architecture by Ralf Niemann

    85/130

    Petru Eles, IDA, LiTH

    Leakage power is consumed even if the circuit is idle (standby). Theonly way to avoid is decoupling from power.

    Short circuit power can be around 10% of total.

    Switching power is still the main source of power consumption.

    For the rest of the discussion, we consider mainly switchingpower. At the end we come back to leakage.

    Hardware/Software Codesign Low Power/Energy - 10

    Power and Energy Consumption

  • 8/10/2019 embedded system architecture by Ralf Niemann

    86/130

    Petru Eles, IDA, LiTH

    P 1

    2--- C VDD

    2f NSW =

    P t 12--- C VDD2 NCY NSW = =

    NCY= number of cycles needed for the particular task.

    In certain situations we are concerned about power consumption:

    - heath dissipation, cooling:

    - physical deterioration due to temperature.

    Sometimes we want to reduce total energy consumed:- battery life.

    Hardware/Software Codesign Low Power/Energy - 11

    Reducing Power/Energy Consumption

  • 8/10/2019 embedded system architecture by Ralf Niemann

    87/130

    Petru Eles, IDA, LiTH

    The main sources:

    Reduce supply voltage

    Reduce switching activity

    Reduce capacitance

    Reduce number of cycles

    Hardware/Software Codesign Low Power/Energy - 12

    Reducing Power/Energy Consumption (contd)

    Ci it l l

  • 8/10/2019 embedded system architecture by Ralf Niemann

    88/130

    Petru Eles, IDA, LiTH

    Circuit level Ordering of transistors in gate (influences capacitance).

    Transistor sizing.

    Logic level

    Dont-care optimization to reduce switching activity.

    Reduce spurious switching activity by balancing the delays ofpaths that converge at each gate.

    Technology mapping.

    State encoding such that switching activity is minimised: ifstate shas a large number of transitions to state q, theyshould be given uni-distant codes.

    Encoding to minimise switching activity in arithmetic units oron the bus.

    Gated clocks: Gate the clocks of circuits (registers, gates,arithmetic units when they are in idle time periods.

    Hardware/Software Codesign Low Power/Energy - 13

    Reducing Power/Energy Consumption (contd)

  • 8/10/2019 embedded system architecture by Ralf Niemann

    89/130

    Petru Eles, IDA, LiTH

    Behavioral level

    Schedule and map operations so that number of cycles isminimised (with increased number of switching per clockcycle) you can run at slower clock rate you can reducesupply voltage.

    Allocate and share modules so that power consumption isreduced (for example, by reducing switching activity)

    Hardware/Software Codesign Low Power/Energy - 14

    Reducing Power/Energy Consumption (contd)

    A hit t l l

  • 8/10/2019 embedded system architecture by Ralf Niemann

    90/130

    Petru Eles, IDA, LiTH

    Architecture level

    Specialise instruction set, datapath, register structure to theparticular architecture, with power consumption as an optimization

    goal.- You have on the chip and you switch only those resources

    (gates) you really need.

    Reduce power consumption on the bus.- lower switching activity: clever encoding, reduce switching ac-

    tivity on the address bus by exploiting correlations;

    - minimise the bus length (capacitance) by optimal moduleplacement.

    - bus segmentation: transform a long heavily loaded global businto a partitioned set of local bus segments.

    Hardware/Software Codesign Low Power/Energy - 15

    Reducing Power/Energy Consumption (contd)

    O ti i th t t

  • 8/10/2019 embedded system architecture by Ralf Niemann

    91/130

    Petru Eles, IDA, LiTH

    Optimise the memory structure.

    - Memory transfers are extremely power hungry: a memorytransfer takes 33 times more energy than an addition!

    Reducing the number of memory accesses is a very efficientway to save power!

    - Adapt the number of caches, their size and associativity, andthe length of the cache line to the application reducenumber of memory transfers.

    - Interesting trade-off: larger caches consume more power but

    reduce number of memory transfers find the right balance!

    Hardware/Software Codesign Low Power/Energy - 16

    Reducing Power/Energy Consumption (contd)

  • 8/10/2019 embedded system architecture by Ralf Niemann

    92/130

    Petru Eles, IDA, LiTH

    Provide instruction support for Power management:

    - Instructions which allow to put in stand-by or shut down certainparts of the system.

    - Instructions which allow to dynamically fix the supply voltage(dynamic voltage scaling).

    Hardware/Software Codesign Low Power/Energy - 17

    Reducing Power/Energy Consumption (contd)

  • 8/10/2019 embedded system architecture by Ralf Niemann

    93/130

    Petru Eles, IDA, LiTH

    System Level

    Static techniques are applied at design time.

    - Compilation for low power: instruction selection consideringtheir power profile, data placement in memory, registerallocation.

    - Algorithm design: find the algorithm which is the most power-efficient.

    - Task mapping and scheduling.

    Dynamic techniques are applied at run time.

    - These techniques are applied at run-time in order to reducepower consumption by exploiting idle or low-workload periods.

    Hardware/Software Codesign Low Power/Energy - 18

    System Level Power Optimization

  • 8/10/2019 embedded system architecture by Ralf Niemann

    94/130

    Petru Eles, IDA, LiTH

    Three techniques will be discussed:

    1. Dynamic power management: a dynamic technique.

    2. Task mapping: a static technique.

    3. Task scheduling with dynamic power scaling: static & dynamic.

    Hardware/Software Codesign Low Power/Energy - 19

    Dynamic Power Management (DPM)

    Decisions:

  • 8/10/2019 embedded system architecture by Ralf Niemann

    95/130

    Petru Eles, IDA, LiTH

    application

    hardware

    power aware OS

    Decisions:

    Switching among multiple powerstates:

    idle

    sleep

    run

    Switching among multiplefrequencies and voltage levels.

    Goal:

    Energy optimization

    QoS constraints satisfied

    Hardware/Software Codesign Low Power/Energy - 20

    Dynamic Power Management (contd)

    Hardware Support (e g Intel Xscale Processor)

  • 8/10/2019 embedded system architecture by Ralf Niemann

    96/130

    Petru Eles, IDA, LiTH

    Hardware Support (e.g. Intel Xscale Processor)

    RUNRUNRUN

    RUN

    IDLE SLEEP

    RUN

    0.75V, 60mW150MHz

    1.3V, 450mW600MHz

    1.6V, 900mW800MHz

    90s

    40mW 160W

    10s

    10s 140ms

    1.5ms

    160s

    RUN: operational

    IDLE: Clocks to theCPU are disabled;recovery is throughinterrupt.

    SLEEP: Mainly

    powered off;recovery throughwake-up event.

    Other intermediate

    states: DEEPIDLE, STANDBY,DEEP SLEEP

    Hardware/Software Codesign Low Power/Energy - 21

    Dynamic Power Management (contd)

  • 8/10/2019 embedded system architecture by Ralf Niemann

    97/130

    Petru Eles, IDA, LiTH

    DPM techniques are used in laptops, personal digital assistants(PDAs), and other portable appliances in order to shut down orplace in stand-by unused devices.The goal is power saving.

    DPM techniques are implemented in the operating system(including Windows 2000 running on laptops).

    The power breakdown for a laptop computer:- 36% of total power consumed by the display

    - 18% by hard-disk

    - 18% by wireless LAN interface

    - 7% by keyboard, mouse, etc.- 21% by digital VLSI circuits.

    dont forgetthese!

    Hardware/Software Codesign Low Power/Energy - 22

    The Basic Concept of DPM

    Wh th t f d i th d i i b

  • 8/10/2019 embedded system architecture by Ralf Niemann

    98/130

    Petru Eles, IDA, LiTH

    When there are requests for a device the device is busy;otherwise it is idle.

    When the device is idle, it can be shut down to enter a low-powersleeping state.

    BusyBusy

    Working WorkingSleeping

    T1 T2 T3 T4

    Device state

    Power state

    Workload

    Time

    Requests Requests

    Idle

    Tsd Twu

    ?

  • 8/10/2019 embedded system architecture by Ralf Niemann

    99/130

    Hardware/Software Codesign Low Power/Energy - 24

    Power Management Policies

    Power management policies are concerned with predictions

  • 8/10/2019 embedded system architecture by Ralf Niemann

    100/130

    Petru Eles, IDA, LiTH

    Power management policies are concerned with predictionsrelated to idle periods:

    - For shut-down: try to predict how long the idle period will be in

    order to decide if a shut-down should be performed.

    - For wake-up: try to predict when the idle period ends, in orderto avoid user delays due to Twu.It is quite difficult, and often the wake-up is started simplywhen a request has arrived.

    Typical Policies:

    1. Time-out

    2. Predictive

    3. Stochastic

  • 8/10/2019 embedded system architecture by Ralf Niemann

    101/130

    Hardware/Software Codesign Low Power/Energy - 26

    Predictive Policy

    The length of an idle period is predicted. If the prediction is for an idleperiod long enough, the shut-down is performed immediately (no time

  • 8/10/2019 embedded system architecture by Ralf Niemann

    102/130

    Petru Eles, IDA, LiTH

    period long enough, the shut down is performed immediately (no timeinterval T1- T2on slide 16).

    Policy

    - L-shaped distribution for Idle PeriodPrevious Busy Period----------------------------------------------------;

    Busy Period

    Idle

    Period

    Short busy periodsare followed by

    long idle periods.Busy periods longerthan a threshold are followed by

    short idle periods.

    Shut down aftershort busy period!

    Hardware/Software Codesign Low Power/Energy - 27

    Stochastic Policy

    Predictions are based on Markov models: requests and power statetransitions of the device are modelled as probabilistic state machines

  • 8/10/2019 embedded system architecture by Ralf Niemann

    103/130

    Petru Eles, IDA, LiTH

    transitions of the device are modelled as probabilistic state machines.

    The power manager observes the arriving requests, the requestqueue and the device generates shutdown commands.

    Environment or user:generates requests

    The device:provides service

    requestqueue

    Power manager

    Markov model:

    device

    Markov model:

    request generator

    requests

    ob

    s.

    obs.

    obs.

    com

    man

    ds

    Hardware/Software Codesign Low Power/Energy - 28

    Mapping and Scheduling for Low Energy

    For many embedded systems DPM techniques like presented

  • 8/10/2019 embedded system architecture by Ralf Niemann

    104/130

    Petru Eles, IDA, LiTH

    For many embedded systems DPM techniques, like presentedbefore, cannot be applied:

    They have no devices like hard-disk, no (or small) display VLSI is a main source of power dissipation.

    They have time constraints we have to keep deadlines(usually we cannot afford shut-down and wake-up times).

    The operating system is small no sophisticated techniques atrun-time.

    The application is known at design time we know a lot aboutthe application already at design time.

    Static techniques can be used (applied at design time).Mapping and scheduling for low energy are important!

    Hardware/Software Codesign Low Power/Energy - 29

    Mapping for Low Energy

    1

  • 8/10/2019 embedded system architecture by Ralf Niemann

    105/130

    Petru Eles, IDA, LiTH

    8

    5

    7

    3

    6

    4

    2

    p3 p4

    Bus

    TaskWCET Energy

    p3 p4 p3 p4

    1 5 6 5 32 7 9 8 4

    3 5 6 5 3

    4 8 10 6 4

    5 10 11 8 66 17 21 15 10

    7 10 14 8 7

    8 15 19 14 9

    Hardware/Software Codesign Low Power/Energy - 30

    Mapping for Low Energy (contd)

    Consider a mapping: Communication times and energy:

  • 8/10/2019 embedded system architecture by Ralf Niemann

    106/130

    Petru Eles, IDA, LiTH

    Consider a mapping: Communication times and energy:

    p3: 1, 3, 6, 7, 8. C1-2: t = 1; E = 3. C3-5: t = 2; E = 5.

    p4: 2, 4, 5. C4-8: t = 1; E = 3. C5-7: t = 1; E = 3.

    Execution time: 52; Energy consumed: 75.

    1

    38 40 42 44 46 48 50 52 54 56 58 60 620 2 4 6 8 10 12 14 16 18 20 22 24 26 30 32 3428 36Time 64

    3

    4

    6 7 8p3

    p4

    bus

    2 5

    C1-2 C5-7C3-5 C4-8

    Hardware/Software Codesign Low Power/Energy - 31

    Mapping for Low Energy (contd)

    Consider a mapping: Communication times and energy:

  • 8/10/2019 embedded system architecture by Ralf Niemann

    107/130

    Petru Eles, IDA, LiTH

    Consider a mapping: Communication times and energy:

    p3: 1, 3, 6, 7. C1-2: t = 1; E = 3. C3-5: t = 2; E = 5.

    p4: 2, 4, 5, 8. C7-8: t = 1; E = 3. C5-7: t = 1; E = 3.

    Execution time: 57; Energy consumed: 70.

    1

    38 40 42 44 46 48 50 52 54 56 58 60 620 2 4 6 8 10 12 14 16 18 20 22 24 26 30 32 3428 36Time 64

    3

    4

    6 7

    8

    p3

    p4

    bus

    2 5

    C1-2 C5-7C3-5 C7-8

    Hardware/Software Codesign Low Power/Energy - 32

    Mapping for Low Energy (contd)

  • 8/10/2019 embedded system architecture by Ralf Niemann

    108/130

    Petru Eles, IDA, LiTH

    The second mapping with 8on p4consumes less energy;

    Assume that we have a maximum allowed delay = 60.

    This second mapping is preferable, even if it is slower!

  • 8/10/2019 embedded system architecture by Ralf Niemann

    109/130

    Hardware/Software Codesign Low Power/Energy - 34

    Real-Time Scheduling with Dynamic Voltage Scaling (contd)

    The scheduling problem:

  • 8/10/2019 embedded system architecture by Ralf Niemann

    110/130

    Petru Eles, IDA, LiTH

    The scheduling problem:

    Which task to execute at a certain moment on a certain processor sothat time constraints are fulfilled?

    The scheduling problem with voltage scaling:

    Which task to execute at a certain moment on a certain processor, and

    at which voltage level, so that time constraints are fulfilled and energyconsumption is minimised?

    The problem: reducing supply voltage extends execution time!

    Hardware/Software Codesign Low Power/Energy - 35

    Variable Voltage Processors

  • 8/10/2019 embedded system architecture by Ralf Niemann

    111/130

    Petru Eles, IDA, LiTH

    Several supply voltage levels are available.

    Supply voltage can be fixed by the application (operating system)through execution of particular instructions.

    Frequency is automatically adjusted to the current supply voltage.

    Several processors with variable voltage levels are already

    available. There will be more and more in the near future.

    Hardware/Software Codesign Low Power/Energy - 36

    The Basic Principle

    We consider a single task :

    total comp tation 109 e ec tion c cles

  • 8/10/2019 embedded system architecture by Ralf Niemann

    112/130

    Petru Eles, IDA, LiTH

    - total computation: 109execution cycles.

    - deadline: 25 seconds.

    - processor nominal (maximum) voltage: 5V.

    - energy: 40 nJ/cycle at nominal voltage.

    - processor speed: 50MHz (50106cycles/sec) at nominal voltage.

    0 5 10 15 20 25 time (sec)

    V2

    52

    slack

    Etotal= 40 J

    texe= 20 sec

    109cycles

    Hardware/Software Codesign Low Power/Energy - 37

    The Basic Principle (contd)

    Lets make it slower!

    V 2 5V

  • 8/10/2019 embedded system architecture by Ralf Niemann

    113/130

    Petru Eles, IDA, LiTH

    VDD= 2.5V

    - energy: 402.52/52= 10nJ/cycle.

    - speed: 502.5/5 = 25MHz

    0 5 10 15 20 25 time (sec)

    V2

    52

    Etotal= 32.5 J

    texe= 25 sec

    2.52

    750106cycles 250106cycles

    Hardware/Software Codesign Low Power/Energy - 38

    The Basic Principle (contd)

    VDD= 4V

  • 8/10/2019 embedded system architecture by Ralf Niemann

    114/130

    Petru Eles, IDA, LiTH

    DD

    - energy: 4042/52= 25nJ/cycle.

    - speed: 504/5 = 40MHz

    0 5 10 15 20 25 time (sec)

    V2

    52

    Etotal= 25 J

    texe= 25 sec42

    109cycles

    Hardware/Software Codesign Low Power/Energy - 39

    The Basic Principle (contd)

    If a processor uses a single supply voltage and completes aprogram just on deadline, the energy consumption is minimised.

  • 8/10/2019 embedded system architecture by Ralf Niemann

    115/130

    Petru Eles, IDA, LiTH

    Consider two tasks 1, 2:

    Computation- 1: 25010

    6execution cycles; 2: 750106execution cycles;

    Deadline: 25 seconds.

    Processor nominal (maximum) voltage: 5V.

    Energy:

    - 40 nJ/cycle at nominal voltage.

    - 25 nJ/cycle at VDD= 4V.

    Processor speed:

    - 50MHz (50106cycles/sec) at nominal voltage.

    - 40MHz at VDD= 4V.

    1

    2

    Hardware/Software Codesign Low Power/Energy - 40

    The Basic Principle (contd)

    Find the voltage so that the tasks just meet their deadline you

  • 8/10/2019 embedded system architecture by Ralf Niemann

    116/130

    Petru Eles, IDA, LiTH

    g j yhave minimised energy consumption!

    0 5 10 15 20 25 time (sec)

    V2

    Etotal= 25 J42

    750106cycles250106

    cycles

    1 2

    Hardware/Software Codesign Low Power/Energy - 41

    Considering Task Particularities

    Energy consumed by a task: NSW= number of gate transitionsper clock cycle

  • 8/10/2019 embedded system architecture by Ralf Niemann

    117/130

    Petru Eles, IDA, LiTH

    1

    2--- C VDD

    2NCY NSW =

    Average energy consumed by task per cycle:

    ECY1

    2--- C VDD

    2NSW =

    Often tasks differ from each other in terms of executed operationsNSWand Cdiffer from one task to the other.

    The average energy consumed per cycle differs from task to task.

    per clock cycle.

    C = switched capacitance perclock cycle.

    Hardware/Software Codesign Low Power/Energy - 42

    Considering Task Particularities (contd)

    Consider two tasks 1, 2: Computation

  • 8/10/2019 embedded system architecture by Ralf Niemann

    118/130

    Petru Eles, IDA, LiTH

    p

    - 1: 250106execution cycles; 2: 75010

    6execution cycles;

    Deadline: 25 seconds.

    Processor nominal (maximum) voltage: 5V.

    Processor speed:

    - 50MHz (50106cycles/sec) at nominal voltage.

    - 40MHz at VDD= 4V.- 25MHz at VDD= 2.5V.

    Energy 1- 50 nJ/cycle at VDD= 5V.

    - 32 nJ/cycle at VDD= 4V.- 12.5 nJ/cycle at VDD= 2.5V.

    Energy 2- 12.5 nJ/cycle at VDD= 5V.

    - 8 nJ/cycle at VDD= 4V.- 3 nJ/cycle at VDD= 2.5V.

    1

    2

    Hardware/Software Codesign Low Power/Energy - 43

    Considering Task Particularities (contd)

    Here we have a solution with VDD= 4V, and deadline just fulfilled:

  • 8/10/2019 embedded system architecture by Ralf Niemann

    119/130

    Petru Eles, IDA, LiTH

    Etotal= 32nJ/cycle 250 106cycles + 8nJ/cycle 750 106cycles

    0 5 10 15 20 25 time (sec)

    V2

    Etotal= 14 J42

    750106cycles250106

    cycles

    1 2

    Hardware/Software Codesign Low Power/Energy - 44

    Considering Task Particularities (contd)

    Here we run 1at VDD= 2.5V, and 2at VDD= 5V; the tasks finishj st on deadline

  • 8/10/2019 embedded system architecture by Ralf Niemann

    120/130

    Petru Eles, IDA, LiTH

    just on deadline.

    Etotal= 12.5nJ/cycle 250 106cycles + 12.5nJ/cycle 750 10

    6cycles

    0 5 10 15 20 25 time (sec)

    V2

    52

    Etotal= 12.5 J

    2.52

    750106cycles250106cycles

    1

    2

    Hardware/Software Codesign Low Power/Energy - 45

    Considering Task Particularities (contd)

  • 8/10/2019 embedded system architecture by Ralf Niemann

    121/130

    Petru Eles, IDA, LiTH

    If power consumption per cycle is not constant (but differs from taskto task), the rule on slide 33 is not true any more.

    Voltage levels have to be reduced with priority for those tasks whichhave a larger energy consumption per cycle.

    One particular voltage level has to be established for each task, sothat deadlines are just satisfied.

    Hardware/Software Codesign Low Power/Energy - 46

    Discrete Voltage Levels

    Practical microprocessors can work only at a finite number of discretevoltage levels.

  • 8/10/2019 embedded system architecture by Ralf Niemann

    122/130

    Petru Eles, IDA, LiTH

    g

    The ideal voltage Videal, determined for a certain task does not exist.

    A task is supposed to run for time texeat the voltage Videal.

    On the particular processor the two closest available neighbours toVidealare: V1< Videal< V2.

    You have minimised the energy if you run the task for time t1atvoltage V1and for t2at voltage V2, so that t1+ t2= texe.

  • 8/10/2019 embedded system architecture by Ralf Niemann

    123/130

    Hardware/Software Codesign Low Power/Energy - 48

    The Pitfalls with Ignoring Leakage

    E NC C eff Vdd2

    Lg Vdd K3 eK

    4 Vdd

    eK

    5 Vbs

    Vbs Iju+( ) t +=

  • 8/10/2019 embedded system architecture by Ralf Niemann

    124/130

    Petru Eles, IDA, LiTH

    Minimise this andignore the rest!

    Hardware/Software Codesign Low Power/Energy - 49

    The Pitfalls with Ignoring Leakage

    E NC C eff Vdd2

    Lg Vdd K3 eK

    4 Vdd

    eK

    5 Vbs

    Vbs Iju+( ) t +=

  • 8/10/2019 embedded system architecture by Ralf Niemann

    125/130

    Petru Eles, IDA, LiTH

    1. We dont optimize global energy but only a part of it!

    2. We can get it even very wrong and increaseenergy

    consumption!

    eff dd g dd 3 bs ju

    Leakage decreaseswith Vdd, but growthwith time!

    Dynamic decreaseswith Vddregardlessof increased time.

    Hardware/Software Codesign Low Power/Energy - 50

    E NC Ceff

    Vdd

    2 L

    g V

    dd K

    3 e

    K4

    VddeK

    5 Vbs

    Vbs

    Iju

    +( ) t +=

  • 8/10/2019 embedded system architecture by Ralf Niemann

    126/130

    Petru Eles, IDA, LiTH

    0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95

    1e-10

    2e-10

    3e-104e-10

    5e-10

    6e-10

    7e-10

    8e-10

    10.50

    Dynamic energy

    Vdd

    Energy

    perCycle

    Jejurikar et. al., DAC04

    70nm technology, Crusoe processor

    Hardware/Software Codesign Low Power/Energy - 51

    E NC C eff Vdd2

    Lg Vdd K3 e

    K4

    Vdd

    e

    K5

    Vbs

    Vbs Iju+( ) t +=

  • 8/10/2019 embedded system architecture by Ralf Niemann

    127/130

    Petru Eles, IDA, LiTH

    0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95

    1e-10

    2e-10

    3e-104e-10

    5e-10

    6e-10

    7e-10

    8e-10

    10.50

    Leakage energy

    Dynamic energy

    Vdd

    EnergyperCycle

    Jejurikar et. al., DAC04

    70nm technology, Crusoe processor

    Hardware/Software Codesign Low Power/Energy - 52

    E NC Ceff

    Vdd

    2 L

    g

    Vdd

    K3

    eK

    4 Vdd

    eK

    5 Vbs

    Vbs

    Iju

    +( ) t +=

    C iti l i t!

  • 8/10/2019 embedded system architecture by Ralf Niemann

    128/130

    Petru Eles, IDA, LiTH

    0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95

    1e-10

    2e-10

    3e-104e-10

    5e-10

    6e-10

    7e-10

    8e-10

    10.50

    Leakage energy

    Dynamic energy

    Dynamic + Leakage

    Vdd

    Energy

    perCycle

    Jejurikar et. al., DAC04

    70nm technology

    Critical point!If you go beyond this

    with Vddenergy grows

    Hardware/Software Codesign Low Power/Energy - 53

    Summary

    Power consumption becomes a central issue for embedded

  • 8/10/2019 embedded system architecture by Ralf Niemann

    129/130

    Petru Eles, IDA, LiTH

    systems design.

    Power/energy consumption can be reduced by reducing supplyvoltage, switching activity, switched capacitance, number ofexecuted cycles.

    There are means at all levels of the design to reduce powerconsumption: circuit, logic, behavioral, architecture, system level.

    At system level we distinguish dynamic techniques (applied duringrun-time) and static techniques (applied at design time).

    Hardware/Software Codesign Low Power/Energy - 54

    Summary (contd)

    Dynamic power management is implemented by the operatingsystem, and is mainly used in portable appliances to shut down orplace in stand by unused devices

  • 8/10/2019 embedded system architecture by Ralf Niemann

    130/130

    Petru Eles, IDA, LiTH

    place in stand-by unused devices.

    Typical policies for power management are: time-out, predictive,and stochastic.

    Both at task mapping and at scheduling, design decisions can be

    made with have a huge impact on power/energy consumption.

    Real-time scheduling in the context of processors with voltagescaling is extremely interesting. The main trade-off is voltage levelvs. execution time. One has to find the optimal voltage levels such

    that energy consumption is reduced and deadlines are stillfulfilled.