Methodology and Tools for Performance Analysis of ...Methodology and Tools for Performance Analysis of Multiprocessor Embedded Systems Ismail Assayad and Sergio Yovine Verimag, France

Methodology and Tools for Performance Analysis of

Multiprocessor Embedded Systems

Ismail Assayad and Sergio Yovine

Verimag, France

ARTIST WS 2007, Berlin

ARTIST WS 2007 – Modelling and Performance Analysis – p.1/22

Presentation plan• Motivation

• Current practices

• Related work

• Methodology

• Framework

• Applications• Video encoding platform application

• Conclusion


Motivation

• Many embedded applications such as video compression,HDTV and packet routing require higher and higherperformance =⇒

1. Hardware becomes multiprocessor

2. Software becomes parallel

• Significant growth in the demand and workload ofembedded architectures =⇒

1. Need to be able to predict the mutual impact of software and hardware on

their performance

2. Framework supporting joint (rather than separate) software and hardware

analysis


Current practices• Modelling approaches:

• Analytical models

• Simulation• Wrapper-based external timing models• Annotated timing models

• Modelling scope:• Software-based

• Analysis of hardware performance is not considered

• Hardware-based• Focuses on hardware design without taking into account software

development or analysis

• Platform-based design• Abstraction levels for modelling both software and hardware

architectures


Related work• Analytical (Thiele et al):

• Specific to the application domain of packet processing

• Network calculus theory for reasoning about interleaved streams of packets

• MPARM (Benini et al)• Wrapper-based

• Multiprocessor simulation platform for analysis of hardware design tradeoffs

• ARTS (Madsen et al):• Annotated DAG for software and communication latencies for hardware

• Software timing models are not resolved to micro-architectures ones

• Metropolis (Vincentelli et al), MESH (Paul et al):• General-purpose approaches for concurrency modelling at unfixed

abstraction level

• SystemC/TLM (Ghenassia et al):• Industrial practices use wrapper-based models (PV, PVT) but does not

propose a method


Our approach• Our approach is a simulation and platform based one, it

provides:• Methodology for concurrency and performance modelling of

micro-architectures

• Modelling hardware at a transaction level and software at a task level

• Wrapper-less annotated-method based timed models for components

• Advantages:• Semantics and methodology for components construction, connexion, and

performance prediction

• Joint software and hardware model-based performance analysis support

• An implementation of our framework is built usingSystemC and TLM


Methodology

SW Model

SW Model

SW Model

HW Model

Mapping

Translation

Simulation

Constraints Synthesis

Code Generation

SW components HW components

Simulation

Engine

user

user

Executable




• Related work

• Methodology

• Framework• Modelling• Hardware meta-models• Example• Software meta-models• Tools

• Application

• Conclusion


Framework - Modelling• Components

• They model transactions behavior of the system

• They communicate through transaction requests and state-change based

events

• They support profiling of predicted performance (eg. used bandwidth,

conflicts, execution and communication times, etc)

4

5

CHANNEL

3

1 ME

MO

RY

2

DATA BUS

writ

e da

ta

write cmdwrite data

CMD BUS

writ

e da

ta

write cmd

write(@,32B,data)

Consumer Producer

P2

P1

notify

write cmd

(HW components)

(SW components)


Framework - Hardware meta-models• Component meta-models

• can be instantiated for modelling hardware micro-architectures at

transaction-level

• are performance-centric and take into account arbitration policies,

transaction-level latencies and generated transaction request traffic

• are composed of following blocks:

Controller

TLTB

Tra

nsac

tor

Arbiter

TR

TR

Buffer

OK

OK

Buffered component


Framework - Hardware meta-models• Component behavior

• Blocks are described by automata whose states are instantiated with

hardware specific functions

• Examples: some variables and functions associated to “Running”and

“Executing” states of controller and transaction automata are hardware

dependent

ReadyStalled

Running

[Not

T i.E

nabl

ed]

[Ti.

Ena

bled

]!Ti .F

ire

?req

Controller automaton


Framework - DRAM example

conditional

operation133

1

133

1ba

nk 2

bank

1

T21

conflict-group

C

CB

BA

A

C

CT22

T11

T12

TLTB block model of a two-bank DRAM (instance of the meta-model)

• In the“Running”state of the controller block:• Upon reception of TR = (i, j, k) concerning a memory access for data in

column i, row j and bank k do:• test if transactions Tk1 and Tk2 are enabled• if it is the case, fire either Tk1 or Tk2 according to the preceding

transaction request (TRlast

k) of bank bk:

· if TRlast

k.j = TR.j then fire Tk2

· otherwise fire Tk1


Framework - Software meta-model

Dispatcher (D)

Scheduler (S)

ET

sched

ET:

notify

TR

HTG

NTNTj

NTj:

Enabled tasks

Completed tasks in other components

To a processeur

Software component

• It is composed of:• A hierarchical task graph (HTG)

• Tasks scheduler

• Transaction requests dispatcher



Dispatcher (D)

Scheduler (S)

ET

sched

ET:

notify

TR

HTG

NTNTj

NTj:

Enabled tasks


To a processeur

Software component

• A task is a sequence of transaction requests, example:� x = x+1 � −→

read(@x, 32B);

[...] //increment x

write(@x, 32B);



Dispatcher (D)

Scheduler (S)

ET

sched

ET:

notify

TR

HTG

NTNTj

NTj:

Enabled tasks


To a processeur

Software component

• Tasks behavior is described using FXML andimplemented by the HTG, example:

‖

rrrrrrrrrrrr

LLLLLLLLLLLL

while(true) while(true)

x++;

(i, i)

$$a=x;

FXML

Translated to−−−−−−−−→

‖

mmmmmmmmmmmmmmmm

LLLLLLLLLLLL

while(true) while(true)

S.sched();read(x); [...]; write(x);

S.notify();

S.sched();read(x);[...];S.notify();

HTG


Framework - Tools• Jahuel:

• Describes software and hardware models in FXML

• Synthesizes executable code for software model (eg : C+posix)

• P-Ware:• Takes hardware and software meta-models instances as input from Jahuel

• Provides joint software and hardware performance prediction by simulation




• Related work

• Methodology

• Framework• Video encoding platform application

• Constraints synthesis results

• VE Performance results

• Conclusion


VE platform - Hardware components

CP

U

mem

ory

D−cache

CP

U

mem

ory

D−cache

Controller

Data Bus component

!fire ?end

Arbiter

TR

P1 FIFOP2 FIFO

Buffer

Transactor

TRTR

T10 cycles

TLTB

TR

OK$_{out}$ (state of DRAM)

Output TRto memorybank’s fifos

over Pi FIFORound−robin

software environment

Data busReq bus

CAM HIDISP

OK$_{out}$

TR

DRAM banks

Hardware architecture components


VE platform - Hardware components

CP

U

mem

ory

D−cache

CP

U

mem

ory

D−cache

software environment

Data busReq bus

Data Bus component

CAM HIDISP

Buffer

DRAM banks

Controller

OKout

TR

!fire

?end

Arbiter

TR

P1 FIFO

P2 FIFO

OKout

Transactor

TR

TR

T

10 cycles

TLTB

TR

OKout (state of DRAM)

Output

TR

to

mem

ory

bank’s

fifo

sover Pi FIFORound-robin

Hardware architecture components


VE platform - System constraintsparvar k

CAMddddddddddddddddddddd

ddddddddddddddddddddd HIjjjjjjjjj

jjjjjjjjj DISP P1 . . .PnZZZZZZZZZZZZZZZZZZZZZZ

ZZZZZZZZZZZZZZZZZZZZZZ

for(k)

P=1/15

for(k)

P=1/15

for(k)

P=1/15

par

[mutex]

iiiiiiiiiiiiiiiiiiiiii

UUUUUUUUUUUUUUUUUUUUU

for(k)

P=1/15

for(k)

P=1/15

CAMOUT Mem: CAM buffer

HIIN Mem: CAM bufferOUT Mem: HI buffer

(CAM,k)-> (HI,k)

DISPIN Mem: HI buffer(HI,k) -> (DISP,k)

FCIN Mem: HI buffer

OUT Mem: FC buffer1 & FC buffer2(HI,k) -> (FC,k)

VEIN Mem: FC buffer1 & FC buffer

OUT Mem: VE buffer(FC,k)->(VE,k)

• Taking into account WCET of CAM, HI, and DISP

(δCAM = δHI = δDISP = 1

30) we synthesize:

8

>

>

>

>

>

>

>

<

>

>

>

>

>

>

>

:

bHI = bCAM + 1

30HI activated 1

30after CAM

bDISP = bHI + 1

30DISP activated 1

30after HI

bFC = bHI + 1

30FC activated 1

30after HI

bVE = bFC + 1

30VE activated 1

30after FC

δFC + δVE ≤ 1

15Execution time of VE and FC smaller than 1

15

• With a 2

3× 1

25s format converter (δFC = 2

3× 1

25): δVE ≤

1

25


Implementations of the MPEG-4 VE

MVD : Motion Vector Diff.

Choice

Motion EstimationME :

MVP : Motion Prediction

Encoded picture

Camera

Zigzag

RunLevel

VLC : Variable Length Coding

Layer

ADD

DCTI

QUANTI

reference

Quant : Quantization

DCT : Discrete Cosine Transf.

PRD : Motion Compensation

DIFF : Diff. On Prediction

Extend_rec

reconstructed

Picture

MPEG-4 VE block diagram

• Two Implementations of ENC:

par

CPU1ooo

oooCPU2OOO

OOO

for(k ∈ Gr1) for(k ∈ Gr2)

ENC(1) ENC(2)

HTG with a 2-thread partition of

ENC

par

CPU1ccccccccccccccccccccccc

ccccccccccccccccccccccc CPU2ggggggg

ggggggg CPU3 CPU4XXXXXXXXXX

XXXXXXXXXX

for(k) for(k) for(k) for(k)

MVP, ME, Choice, MVD Prd, DIFF, DCT QUANT, QUANTI DCTI, ADD, extend rec

HTG with a pipeline partition of

ENC


Performance results of VE• Results using P-Ware:

Seq Hybrid (parallel of pipelines)Parallel

cache misses quotient (in %)Required bandwidth (in 1/800 Mb/s)

(a)

(b)

(c) (d) (e) (f) (g)

execution time (in 1/25 s)

1/25 s

0

0.5

1

1.5

2

2.5

3

3.5

4

0 2 4 6 8 10 12 14 16 0

0.5

1

1.5

2

2.5

3

3.5

4

0 2 4 6 8 10 12 14 16 0

0.5

1

1.5

2

2.5

3

3.5

4

0 2 4 6 8 10 12 14 16 0

0.5

1

1.5

2

2.5

3

3.5

4

0 2 4 6 8 10 12 14 16

Execution time, cache misses and used bandwidth for different implementations

Processors number

• (d), (f) and (g) satisfy execution time constraints

• Hybrid ones, i.e. (f) and (g), produce an increase of bandwidth usage

• The best compromise seems to be (d), consisting of 4 MPEG-threads


Conclusion - Framework• Component-based modelling framework combining

transaction-level HW and programmer-level SW models• Joint HW and SW modelling and performance analysis

allowing for predicting:1. Impact of HW on SW performance

2. Ability of HW to accommodate future services

• A programming and simulation tools supporting theframework

1. Jahuel

2. P-Ware


Conclusion - Applications• Application to real-life industrial systems:

• MPEG-4, IPv4, Philips WASABI NoC, and Intel’s IXP2800

• Models expressiveness:• Data granularity is easily set-up using the imlementation of software

dispatchers and/or hardware transactors: bus-packet (eg. a line of pixels)

suited for MEPG-4, and bus-size for IPv4.

• Tool performance:• Scalable prototype: eg. dual IXP NP with 768 memory bank component

• Fast simulation speeds: average 300 000 cycles/s

• Models precision:• Precise performance results: eg. values are within 5% of the ones obtained

by the IXP2800 cycle-accurate simulator for several data granularities

• Correct performance trends

• Joint software and hardware modelling environment:• Automatization using Jahuel

• Fast and efficient system design with user-driven joint software and

hardware performance tracking using P-WareARTIST WS 2007 – Modelling and Performance Analysis – p.21/22

Thank you!


Documents

Methodology and Tools for Performance Analysis of ...Methodology and Tools for Performance Analysis of Multiprocessor Embedded Systems Ismail Assayad and Sergio Yovine Verimag, France