Optimization-centric embedded systems design: addressing ...lia.disi.unibo.it/Courses/AI/applicationsAI2007-2008/Lucidi/Seminari/... · Optimization-centric embedded systems design:

1

Optimization-centric embeddedsystems design: addressingthe multi-core challenge

L. Benini, DEIS Università di Bologna,

[email protected]

Year of IntroductionYear of Introduction20052005 20072007 20092009 20112011 20132013 20152015

5GOPS/W5GOPS/W

100 GOPS/W100 GOPS/W

SignSignrecognitionrecognition

A/VA/Vstreamingstreaming

AdaptiveAdaptiverouteroute

CollisionCollisionavoidanceavoidance

AutonomousAutonomousdrivingdriving

3D projected3D projecteddisplaydisplay

HMI by motionHMI by motionGesture detectionGesture detection

UbiquitousUbiquitousnavigationnavigation

SiSi XrayXray

GbitGbit radioradio

UWBUWB

802.11n802.11n

Structured Structured encodingencoding

Structured Structured decodingdecoding

3D TV3D TV 3D gaming3D gaming

H264H264encodingencoding

H264H264decodingdecoding

ImageImagerecognitionrecognition

Full recognitionFull recognition(security)(security)

AutoAutopersonalizationpersonalization

dictationdictation

3D ambient3D ambientinteractioninteraction

LanguageLanguageEmotionEmotionrecognitionrecognition

GestureGesturerecognitionrecognition

ExpressionExpressionrecognitionrecognition

MobileMobileBaseBase--bandband

1 TOPS/W1 TOPS/W

Philips/IMECPhilips/IMEC

Embedded Applications Pull

2

Moore’s Law Revisited

2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 20200

200

400

600

800

1000

60

50

40

30

20

10

0

1200

Num

ber

of P

roce

ssin

g En

gine

s

Logi

c, M

emor

y Si

ze (N

orm

aliz

ed to

200

5)

Number of Processing Engines(Right Axis)

Total Logic Size(Normalized to 2005, Left Axis)

Total Memory Size(Normalized to 2005, Left Axis)

16 23 32 46 63 79101

133 161212

268

348

424

526

669

878

[Martin06]

AmarasingheAmarasinghe 0606

Rapport Rapport KilocoreKilocore

The Multicore Revolution

3

“Traditional” Bus-based SoCs fit in one tile !!Extreme NRE cost: complexity, unwieldy technologyHigh risk, large investment decreasing design starts

Flexibility is key programmability, configurabilityResource management (computation, communication storage) is essential to achieve efficiency

Allocation and scheduling of NUMEROUS, HETEROGENOUS resourcesMarket success hinges on optimal exploitation of available resources

Heterogeneous Multicore

I/0

I/0

PE

PE PE PE

SRAM SRAM

DRAM

I/O

I/OPERIPHERALS

3D stacked m

ain mem

ory

PE

LocalMemory

hierarchy

CPU

i/o

Applications Software opt. Middleware, RTOS, API,Run-Time Controller

MappingVDD,VT,fclk

Resource Management IS OptimizationSearch spaceThe set of “all” possible allocations and

schedules

ConstraintsSolutions that we are not willing to accept

Cost functionA property we are interested in (execution

time, power, reliability…)

Many approaches to optimal RM: off-line vs. onlinecomplete (exact) vs. heuristic (approximate)

4

When & Why Complete, OfflineOptimization?

Plenty of design-time knowledgeApplications pre-characterized at design timeDynamic transitions between different pre-characterized scenarios

Aggressive exploitation of system resourcesReduces overdesign (lowers cost)Strong performance guarantees

Applicable for many embedded applications

Optimization-centric RM in Practice

MPSoCSystem

ProgrammingModelApplication

Modeling Modeling

Abstraction Gap!…

PPE

PPU Storage Subsystem

PPU

Bus/if

Element Interconnect Bus (EIB)

DRAMMemory

DMA

SPE 0

MFC

SPU

MMU

Bus/if

Synch.

Local Storage

DMA

SPE 7

MFC

SPU

MMU

Bus/if

Synch.

Local Storage T1

T2 T3

T4 T5 T6

T7

T8

Optimization

Time

Res

ourc

es

T1 T2

T3

T4

T5 T7

Deadline

T8

Allocation+scheduling

Optimality Gap!

ImplementationExecution Gap!

5

Bridging the Abstraction Gap

Target architecture Homogeneous computation tiles:

ARM cores (including instruction and data caches);Tightly coupled software-controlled scratch-pad memories (SPM);

AMBA AHB;DMA engine;RTEMS OS;Power models for 0.13μm

Variable Voltage/Frequency cores with discrete (Vdd,f) pairs

Frequency dividers scale down system clockVoltage switching overhead is modeled

Cores use non-cacheable shared memory to communicateSemaphore and interrupt facilities are used for synchronization

Tile TileTile Tile …Sync. Sync. Sync. Sync.

PrivateMem

PrivateMem

PrivateMem

PrivateMem

SharedMem

AMBA AHB INTERCONNECT

PrivateMem..

Prog.REG

CLOCK TREEGENERATOR

SystemC

LOC

K

CLOCK NCLOCK 3

CLOCK 2CLOCK 1

INTSlave

… Int_

CLKTile TileTile Tile …

Sync. Sync. Sync. Sync.

PrivateMem

PrivateMem

PrivateMem

PrivateMem

SharedMem

AMBA AHB INTERCONNECT

PrivateMem..

Prog.REG

CLOCK TREEGENERATOR

SystemC

LOC

K

CLOCK NCLOCK 3

CLOCK 2CLOCK 1

INTSlave

… Int_

CLK

Model accuracy is essential handle abstraction with care!

6

Bus modellingBus congestion causes performance unpredictability

Unary model

Max bandwidthBus

time

t1 t2 t1 t2 t1 t2

Characterize the range of applicability of the modelVirtual platform;

Force the system to work under the additive regime.

Coarse-grain additivetime

BusMax bandwidth

t1

t2Achievable bandwidth

Millions of bus transactions: Model blowup!

Tuning of the additive model

Requesting more than 65% of the maximum bandwidth causes the additive model to fail;Benefits of working in additive regime:

Task execution time almost independent of bus utilization;Performance predictability greatly enhanced.

Task

Exe

cutio

n Ti

me

7

Task graphTask dependenciesExecution times express in clock cycles: WCN(Ti)Communication time (writes & reads) expressed as: WCN(WTiTj) and WCN(RTiTj)These values can be back-annotated from functional simulation or computed using WCET analysis tools (e.g. AbsINT)Node type

Normal; Fork, And; Branch, Or

Application model

Task1

Task2

Task3

Task4

Task5

Task6

WCN(WT1T2)WCN(RT1T2)WCN(T1)

WCN(WT1T3)WCN(RT1T3)

WCN(T2) WCN(WT2T4)WCN(RT2T4)




WCN(T3)

WCN(T4)

WCN(T5)

WCN(T6)

ReadingPhase

Exec.Phase

WritingPhase

Syst

em B

us

Priv

ate

Mem

Priv

ate

Mem

ARM Core

Int controller

SPM

Semaphores

ARM Core

Int controller

Semaphores

SPM

#2#1

Task memory requirements

Communicating tasks might run:On the same processor → negligible communication costOn different processors → costly message exchange procedure

Task storage can be allocated by Optimizer:On the local SPMOn the remote Private Memory

Each task has three kinds of memory requirements

Program DataInternal StateCommunication queues

8

Syst

em B

us

Priv

ate

Mem

Priv

ate

Mem

ARM Core

Int controller

SPM

Semaphores

ARM Core

Int controller

Semaphores

SPM

Task memory requirements

Each task has three kinds of memory requirements:

Program Data;Internal State;Communication queues.

#2

#1

Communicating tasks might run:On the same processor → negligible communication costOn different processors → costly message exchange procedure

Task storage can be allocated by Optimizer:On the local SPMOn the remote Private Memory

Application Development Flow

CTGCalibration

Phase

Simulator/WCET

OptimizationPhase

Optimizer

ApplicationProfiles

Optimal SWApplication

Implementation

Alloca

tion

Sched

uling

ApplicationDevelopment

Support

PlatformExecution

9

MAX error lower than 10%AVG error equal to 4.51%, with standard deviation of 1.94All deadlines are met

Optimizer

Optimal Allocation & Schedule

Virtual Platform validation -0.05

0

0.05

0.1

0.15

0.2

0.25

-5% -4% -3% -2% -1% 0% 1% 2% 3% 4% 5% 6% 7% 8% 9% 10% 11%

250 instances

Validation of optimizer solutions Throughput

Prob

abili

ty (%

)Throughput difference (%)

MAX error lower than 10%;AVG error equal to 4.80%, with standard deviation of 1.71;

Optimizer

Optimal Allocation & Schedule

Virtual Platform validation

250 instances

Validation of optimizer solutions Power

-0.05

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

-5% -4% -3% -2% -1% 0% 1% 2% 3% 4% 5% 6% 7% 8% 9% 10% 11%

250 instances

Prob

abili

ty (%

)

Energy consumption difference (%)

10

Bridging the Optimality Gap

The Optimization Challenge

The problem of allocating, scheduling for task graphs on multi-processors in a distributed real-time system is NP-hard.

T1

T2 T3

T4 T5 T6

T7

T8

…Proc. 1 Proc. 2 Proc. N

INTERCONNECT

Private

Mem

Private

Mem

Private

Mem…

T1 T2 T3T4 T5 T6T8 T7

Time

Res

ourc

es

T1 T2

T3

T4

T5 T7

Deadline

T8

Allocation

Schedule

11

Scheduling & Voltage Scaling

deadlinet

P

τ1 τ2 τ3

Energy/speed trade-offs:varying the voltages

Vbs

CPUVdd

f1 f2 f3

Different voltages:different frequencies

Mapping and scheduling: given (fastest freq.)

Power

deadlinetτ1 τ2 τ3

SlackVoltage and Frequency scalingmake the problem even harder!

Current off-line approachessolve mapping, scheduling and voltage

selection separately (sequentially)

Optimization frameworkDeterministic & stochastic task graphsConstraints

Resources: computation, communication, storageTiming: task deadlines, makespan

Objective functionsPerformance (e.g. Makespan)Power (energy)Bus utilization

General modeling framework highly unstructured optimization problems

No black-box/generic optimizer can solve them efficientlyWe developed a flexible algorithmic frameworkwich is tuned on specific problems

12

Logic Based Benders DecompositionObj. Functioneg. system energy

In general, it depends on MP & SP variables

AllocationLB for cost

Allocation& Freq. Assign.:

INTEGER PROGRAMMING

Scheduling:CONSTRAINT PROGRAMMING

No good: unfeasible SPOptimality cut: SP solution is optimal UNLESS a betterone exists with a different allocation

Resource constraints

Timing constraint

SP Relaxation Constraint(eg. sum of ExecT on a Proc does not exceed deadline)

Subproblem (SP)

Master problem (MP)

Obj. Function:

(Minimize freq switching overheads)

Iterations stop when MP becomes unfeasible!

All+SchUB for cost

Computational scalability

Simplified CP and IP formulationsHybrid approach clearly outperforms pure CP and IP techniquesSearch time bounded to 1000 sec.

CP and IP can found a solution only in 50%- of the instancesHybrid approach always found a solution

Deterministic task graphs, mapping & scheduling

16 25 36 49 64 81 100 1 2 3 4 5 6 7

13

Computational Scalability

Hundreds of of decision variablesMuch beyond ILP solver or CP solver capability

Deterministic task graphs, mapping & scheduling & v,f selectionStochastic task graphs, mapping & scheduling & min bus usage

Does it Matter?

Throughput required: 1 frame/10ms.With 2 processors and 4 possible frequency & voltage settings:

Task Graph:10 computational tasks;15 communication tasks.

Without optimizations:50.9μJ

With optimizations:17.1 μJ - 66,4%

14

Optimality gapComparison with heuristic 2-phase solution (GA)

“timing barrier”

gap significant when constraints are tight

Bridging the Execution Gap

15

Programming environment

A software development toolkit to help programmers in software implementation:

a generic customizable application template OFFLINE SUPPORT;a set of high-level APIs ONLINE SUPPORT in RT-OS

The main goals are:predictable application execution after the optimization step;guarantees on high performance and constraint satisfaction.

Starting from a high level task and data flow graph, software developers can easily and quickly build their application infrastructure.Programmers can intuitively translate high level representation into C-code using our facilities and library

//Node Behaviour: 0 AND ; 1 OR; 2 FORK; 3 BRANCHuint node_behaviour[TASK_NUMBER] = {2,3,3,..};

#define N_CPU 2uint task_on_core[TASK_NUMBER] = {1,1,2,1};int schedule_on_core[N_CPU][TASK_NUMBER] = {{1,2,4,8}..};

uint queue_consumer [..] [..] = {{0,1,1,0,..},{0,0,0,1,1,.},{0,0,0,0,0,1,1..},{0,0,0,0,..}..};

//Node Type: 0 NORMAL; 1 BRANCH ; 2 STOCHASTICuint node_type[TASK_NUMBER] = {1,2,2,1,..};

ExampleNumber of nodes (e.g 12)Graph of activitiesNode type

Normal, Branch, Conditional, Terminator

Node behaviourOr, And, Fork, Branch

Number of CPUs : 2Task AllocationTask SchedulingArc priorities

Time

Res

ourc

es

N1 B2

B3

C4

C7

Deadline

N8

T2 T3

T4 T5 T6 T7

T8 T9 T10

T11

T12

T1N1

B2 B3

C4 C5 C6 C7

N8 N9 N10

N11

T12

fork

or

or

and

branch branch

P1P2

N11

N10

T12

a1a2

a3 a4 a5 a6

a7 a8 a9 a10

a11 a12

B3 C7 N10

T12

a13

a14

#define TASK_NUMBER 12

16

Queue ordering optimization

Communication ordering affects system performance

T1

T2T4

CPU1 CPU2

…

C3C1

T3

…C

2

Wait!

RUN!

T5 T6… …

C4 C5

Queue ordering optimization

Communication ordering affects system performance

T1

T2T4

T5 T6

CPU1 CPU2

… … …

C3C1

T3

…

C2

Wait!

RUN!

C4 C5

17

T4 re-activated

Synchronization among tasks

T1

T2 T4C2

T3

C1

C3

Proc. 1

T1

Proc. 2

T2T3 T4

T4 is suspended

Non blocking semaphores

© 2005 IBM Corporation34

Toshiba

Putting it all togetherCELLFLOW:Targeting the Cell BE

Project goal: develop a high-level programming environment with optimal allocation and scheduling for the CELL SDK

18

Cell BE Processor Architecture

Heterogeneous, distributed memory multiprocessor with explicit DMA over a ring-NoC

SPE0LS

(256KB)

DMA

SPE1LS

(256KB)

DMA

MICMemoryInterfaceController

XIO

SPE2LS

(256KB)

DMA

SPE3LS

(256KB)

DMA

SPE4LS

(256KB)

DMA

SPE5LS

(256KB)

DMA

SPE6LS

(256KB)

DMA

PPEL1 (32 KB I/D)

L2(512 KB)

Flex-IO1

Flex-IO0

I/O

I/O

I/O

Focus on SPEs1. Statically scheduled, non-preemptive2. Explicit allocation and de-allocation of local store

Task & Application Models

Task is split into 3 phases:Reading input queuesTask ExecutionWriting output queues

Tasks communicate through queuesFIFO BufferingSemaphore synchronization

No task preemption

ReadingPhase

Exec.Phase

WritingPhase

Void Main_task(int id,void* input,void* output,void state)

{//InitializationInit_task_structures(); Init_queues(); …//Task Core//Reading phaseRead_input();

….

//Task ExecutionExec();

….//Writing phase

Write_output(); }

19

uint queue_consumer [..] [..] = {{0,1,1,0,..},{0,0,0,1,1,.},{0,0,0,0,0,1,1..},{0,0,0,0,..}..};

#define N_CPU 2uint task_on_core[TASK_NUMBER] = {1,1,2,1};int schedule_on_core[N_CPU][TASK_NUMBER] = {{1,2,4,8}..};

Framework

CPU CPU

INTERCONNECT

MemMem

T1 T2 T3T4 T5 T6T8 T7

RunTimeSupport

RunTimeSupport

Number of nodes : 12Graph of activitiesNumber of CPU : 2AllocationScheduling

#define TASK_NUMBER 12Eclipse Plugin - Available resource management engines: • optimum static schedulers based

on integrated IP and CP solvers (Benders decomposition) • fast & suboptimal list-based scheduler for dynamic scheduling

P1P2

• Allocation;• Scheduling;• Communication;• Synchronization.

Run Time SupportRun Time Support

ResourcesResourcesOptimizerOptimizer

Comm.Comm.APIAPI

Synch.Synch.APIAPI

MappingMappingAPIAPI

GUIGUI

Appl.Appl.TemplateTemplate

On-line Support

Automatic Application Configuration:Mailbox systemTask Generation and ConfigurationBuffer allocations

Main Memory

Root Segment 0Scheduler

Segment 1

Local Storage

SPE … SPESPEPPE

Allocate queue

Queue info

Task 0 infoTask 1 info

Task N info

…

…

…

…

Allocation

info

Physicaladdresses

No-OS on SPEs:Local scheduling support

Local Memory Limitation:Code overlays

Task 0 Task 1 Task 2

20

Run Time SupportRun Time Support

ResourcesResourcesOptimizerOptimizer

Comm.Comm.APIAPI

Synch.Synch.APIAPI

MappingMappingAPIAPI

GUIGUI

Appl.Appl.TemplateTemplate

Multi-stage Benders Decomposition

When the SCHED problem is solved, one or more cuts (A) are generated to forbid the current memory device allocation and the process is restarted from the MEM stage;

if the scheduling problem is feasible, an upper bound on the value of the next solution is also posted.

When the MEM & SCHED sub-problem ends, more cuts (B) are generated to forbid the current task-to-SPE assignment.

When the SPE stage becomes infeasible the process is over converging to the optimal solution for the problem overall.

SPE Alloc

MEM Alloc

SCHEDA

B

Goal: minimize makespan

21

SPE Allocation

1,...,0)(

1,...,0;1,...,0}1,0{

1,...,01

1,...,0

..min

1

0

1

0

1

0

−=∀≤

−=∀−=∀∈

−=∀=

−=∀≥

∑

∑

∑

−

=

−

=

−

=

pjdlineTiDMIN

pjniT

niT

pjTz

tsz

n

iij

ij

p

jij

n

iij

Given a graph with n tasks, m arcs and a platform with p processing Elements

Each task can be assigned to a single PE;The makespan objective function depends only on scheduling decision variables.

We adopt an heuristic objective function:to spread tasks as much as possible on different SPEs, which often provides good

makespan values pretty quickly.It forces the objective variable z to be

greater than the total number of tasks allocated on any PE.DMIN(i) is the minimum possible duration of task i and dline is a deadline

Constraints on the total duration of tasks on a single SPE.To discard trivially infeasible solutions.

Needed to express the objective function

Schedulability test

SPE allocation choices are by themselves very relevant:a bad SPE assignment is sufficient to make the scheduling problem unfeasible.

if the given allocation with minimal task durations is already infeasible for the scheduling component, then it is useless to complete it with the memory assignment that cannot lead to any feasible solution overall.

SPE Alloc

MEM Alloc

SCHED

SCHED Test

22

Memory device allocation

∑∑

∑∑ ∑

−=−=

≠=

==−=

−

=

≤−+−+−+

=∀−=∀

++=

==≠≤+

−=∀∈−=∀∈−=∀∈

),(),(

)()()(

),()(

),(

1

0

)()1()()1()()1()(_:)(..;1,...,0

)()()()(_

)()()()(1

1,...,0}1,0{1,...,0}1,0{1,...,0}1,0{

hrhr

khrhr

tajr

tair

kpehpejkpetta

r

jhpeta

p

ir

rr

rr

r

r

i

CrcommWimemMrcommRjusagebasejipetsipj

WrcommMiimemRrcommjusagebase

kpehpeifWRkpehpeifWR

mrRmrWniM Mi = 1 if task i allocates its computation data on the

local memory of the SPE it is assigned to

Wr = 1 if the communication buffer is on SPE pe(h) (that of the producer),Rr = 1 if the buffer is on SPE pe(k) (that of the consumer).

Pr CnWr=1;Rr=0;

Pr CnWr=0;Rr=1;

Pr CnWr=0;Rr=0;

Pr CnWr=1;Rr=1;

Pr CnWr=0;Rr=0;

∑∑

∑∑ ∑

−=−=

≠=

==−=

−

=

≤−+−+−+

=∀−=∀

++=

==≠≤+

−=∀∈−=∀∈−=∀∈

),(),(

)()()(

),()(

),(

1

0

)()1()()1()()1()(_:)(..;1,...,0

)()()()(_

)()()()(1

1,...,0}1,0{1,...,0}1,0{1,...,0}1,0{

hrhr

khrhr

tajr

tair

kpehpejkpetta

r

jhpeta

p

ir

rr

rr

r

r

i

CrcommWimemMrcommRjusagebasejipetsipj

WrcommMiimemRrcommjusagebase

kpehpeifWRkpehpeifWR

mrRmrWniM

mem(i) is the amount of memory required to store internal data of task I; comm(r) is the size of the communication buffer associated to arc r. The base_usage(j) expression is the amount of memory needed to store all data permanently allocated on the local device of processor j.

Memory device allocation

23

Scheduling subproblem

)()(1,...,0

)()(2,...,)()()()(

)()(2,...,0

1

1

1

rr

rlrl

rhi

irh

rlrl

rdstartwrendmr

wrstartwrendkhlwrstartexecendexecstartrdend

rdstartrdendhl

≤−=∀

=−=∀==

=−=∀

+

−

+

Rd Rd Rd Wr WrExec

Rd Wr WrExec

Each communication buffer must be written before it can be read.

• One activity for each:• execution phase (exec)• buffer reading/writing operation (rd,wr).

• Task are not preemptive;

Exact vs. Heuristic SchedulerHeuristic: RR resource allocation + List schedulingUp to 40% makespan difference15% in average

Example: exact vs. heuristic Gantt charts

Only a combination of Allocation+Scheduling achieves optimality

24

TD vs pure-CP

Set of instances where task durations depend on allocation decisions

Set of instances where task durations do not depend on allocation decisions

TD vs BD

Up to the 20 − 21 group, TD is much more efficient than BD. Starting from group 22−23, the high number of timed out instances biases the average execution time. TD is doing considerably better until group 24 − 25.

After that, most instances are not solved within the time limit by any of the approachesTD has a lower execution time, despite it generally performs more iterations than BD:

TD works by solving many easy sub-problemsBD performs fewer and slower iterations.

25

Cellflow: Benchmark Results

0

1

2

3

4

5

6

7

0 1 2 3 4 5 6 7

# SPEsSp

eed

up

Theoretical LimitMat-multSW RadioFFT

Mat-mult SW Radio FFT

• The Mat-mult benchmark scales almost perfectly:• efficiency of the runtime environment • almost negligible overhead.

• Good speed-ups also for the FFT ;• The software radio benchmark shows good speedup until only three SPEs:

• A critical path limits the performance boost • Functional pipelining can help

On 1 SPE

Real-life applications

Average time (μs) and speedups Single Block computation time (μs)

FFT FDT iFFTDisp Coll

On N SPEs

…..FFT FDT iFFT

Disp Coll

FFT FDT iFFT

FFT FDT iFFT

1 SPU 2 SPU 4 SPU 6 SPU0

200

400

600

800

1000

1200

1400

1600

1800

2000

DMA_PUTIFFTCVE-CVMLFFTDMA_GETattesa

1 SPU 2 SPU 4 SPU 6 SPU0

5000

10000

15000

20000

25000

30000

35000

40000

45000

50000

RT Radar Processing

26

Ongoing & Future WorkExtensions

Scheduling parallel DMA activityRobust schedules with non WCET executionFull dataflow (FIFO buffers) support

Performance tuning of the middlewareInteraction with high level tools:

Parallelization toolsData distribution toolsModel-based environments

Dynamic resource managementHybrid approaches (offline + online)

Integration with advanced architecturesPredictable (QoS) communication protocolsMultihop (NoC) interconnects

Computational Efficiency:distribution of the allocation/scheduling time ratio for the solvers

TD Solved TD Unsolved

BD Solved BD Unsolved

The solution time for instances not solved within the limit appears to be strongly unbalanced with most of the time absorbed by the scheduling component.

For the BD solver where substantially all the process time is spent in solving allocation subproblems.

Documents

Optimization-centric embedded systems design: addressing ...lia.disi.unibo.it/Courses/AI/applicationsAI2007-2008/Lucidi/Seminari/... · Optimization-centric embedded systems design: