Upload
others
View
8
Download
0
Embed Size (px)
Citation preview
Methodology and Tools for Performance Analysis of
Multiprocessor Embedded Systems
Ismail Assayad and Sergio Yovine
Verimag, France
ARTIST WS 2007, Berlin
ARTIST WS 2007 – Modelling and Performance Analysis – p.1/22
Presentation plan• Motivation
• Current practices
• Related work
• Methodology
• Framework
• Applications• Video encoding platform application
• Conclusion
ARTIST WS 2007 – Modelling and Performance Analysis – p.2/22
Motivation
• Many embedded applications such as video compression,HDTV and packet routing require higher and higherperformance =⇒
1. Hardware becomes multiprocessor
2. Software becomes parallel
• Significant growth in the demand and workload ofembedded architectures =⇒
1. Need to be able to predict the mutual impact of software and hardware on
their performance
2. Framework supporting joint (rather than separate) software and hardware
analysis
ARTIST WS 2007 – Modelling and Performance Analysis – p.3/22
Current practices• Modelling approaches:
• Analytical models
• Simulation• Wrapper-based external timing models• Annotated timing models
• Modelling scope:• Software-based
• Analysis of hardware performance is not considered
• Hardware-based• Focuses on hardware design without taking into account software
development or analysis
• Platform-based design• Abstraction levels for modelling both software and hardware
architectures
ARTIST WS 2007 – Modelling and Performance Analysis – p.4/22
Related work• Analytical (Thiele et al):
• Specific to the application domain of packet processing
• Network calculus theory for reasoning about interleaved streams of packets
• MPARM (Benini et al)• Wrapper-based
• Multiprocessor simulation platform for analysis of hardware design tradeoffs
• ARTS (Madsen et al):• Annotated DAG for software and communication latencies for hardware
• Software timing models are not resolved to micro-architectures ones
• Metropolis (Vincentelli et al), MESH (Paul et al):• General-purpose approaches for concurrency modelling at unfixed
abstraction level
• SystemC/TLM (Ghenassia et al):• Industrial practices use wrapper-based models (PV, PVT) but does not
propose a method
ARTIST WS 2007 – Modelling and Performance Analysis – p.5/22
Our approach• Our approach is a simulation and platform based one, it
provides:• Methodology for concurrency and performance modelling of
micro-architectures
• Modelling hardware at a transaction level and software at a task level
• Wrapper-less annotated-method based timed models for components
• Advantages:• Semantics and methodology for components construction, connexion, and
performance prediction
• Joint software and hardware model-based performance analysis support
• An implementation of our framework is built usingSystemC and TLM
ARTIST WS 2007 – Modelling and Performance Analysis – p.6/22
Methodology
SW Model
SW Model
SW Model
HW Model
Mapping
Translation
Simulation
Constraints Synthesis
Code Generation
SW components HW components
Simulation
Engine
user
user
Executable
ARTIST WS 2007 – Modelling and Performance Analysis – p.7/22
Presentation plan• Motivation
• Current practices
• Related work
• Methodology
• Framework• Modelling• Hardware meta-models• Example• Software meta-models• Tools
• Application
• Conclusion
ARTIST WS 2007 – Modelling and Performance Analysis – p.8/22
Framework - Modelling• Components
• They model transactions behavior of the system
• They communicate through transaction requests and state-change based
events
• They support profiling of predicted performance (eg. used bandwidth,
conflicts, execution and communication times, etc)
4
5
CHANNEL
3
1 ME
MO
RY
2
DATA BUS
writ
e da
ta
write cmdwrite data
CMD BUS
writ
e da
ta
write cmd
write(@,32B,data)
Consumer Producer
P2
P1
notify
write cmd
(HW components)
(SW components)
ARTIST WS 2007 – Modelling and Performance Analysis – p.9/22
Framework - Hardware meta-models• Component meta-models
• can be instantiated for modelling hardware micro-architectures at
transaction-level
• are performance-centric and take into account arbitration policies,
transaction-level latencies and generated transaction request traffic
• are composed of following blocks:
Controller
TLTB
Tra
nsac
tor
Arbiter
TR
TR
Buffer
OK
OK
Buffered component
ARTIST WS 2007 – Modelling and Performance Analysis – p.10/22
Framework - Hardware meta-models• Component behavior
• Blocks are described by automata whose states are instantiated with
hardware specific functions
• Examples: some variables and functions associated to “Running”and
“Executing” states of controller and transaction automata are hardware
dependent
ReadyStalled
Running
[Not
T i.E
nabl
ed]
[Ti.
Ena
bled
]!Ti .F
ire
?req
Controller automaton
ARTIST WS 2007 – Modelling and Performance Analysis – p.11/22
Framework - DRAM example
conditional
operation133
1
133
1ba
nk 2
bank
1
T21
conflict-group
C
CB
BA
A
C
CT22
T11
T12
TLTB block model of a two-bank DRAM (instance of the meta-model)
• In the“Running”state of the controller block:• Upon reception of TR = (i, j, k) concerning a memory access for data in
column i, row j and bank k do:• test if transactions Tk1 and Tk2 are enabled• if it is the case, fire either Tk1 or Tk2 according to the preceding
transaction request (TRlast
k) of bank bk:
· if TRlast
k.j = TR.j then fire Tk2
· otherwise fire Tk1
ARTIST WS 2007 – Modelling and Performance Analysis – p.12/22
Framework - Software meta-model
Dispatcher (D)
Scheduler (S)
ET
sched
ET:
notify
TR
HTG
NTNTj
NTj:
Enabled tasks
Completed tasks in other components
To a processeur
Software component
• It is composed of:• A hierarchical task graph (HTG)
• Tasks scheduler
• Transaction requests dispatcher
ARTIST WS 2007 – Modelling and Performance Analysis – p.13/22
Framework - Software meta-model
Dispatcher (D)
Scheduler (S)
ET
sched
ET:
notify
TR
HTG
NTNTj
NTj:
Enabled tasks
Completed tasks in other components
To a processeur
Software component
• A task is a sequence of transaction requests, example:� x = x+1 � −→
read(@x, 32B);
[...] //increment x
write(@x, 32B);
ARTIST WS 2007 – Modelling and Performance Analysis – p.13/22
Framework - Software meta-model
Dispatcher (D)
Scheduler (S)
ET
sched
ET:
notify
TR
HTG
NTNTj
NTj:
Enabled tasks
Completed tasks in other components
To a processeur
Software component
• Tasks behavior is described using FXML andimplemented by the HTG, example:
‖
rrrrrrrrrrrr
LLLLLLLLLLLL
while(true) while(true)
x++;
(i, i)
$$a=x;
FXML
Translated to−−−−−−−−→
‖
mmmmmmmmmmmmmmmm
LLLLLLLLLLLL
while(true) while(true)
S.sched();read(x); [...]; write(x);
S.notify();
S.sched();read(x);[...];S.notify();
HTG
ARTIST WS 2007 – Modelling and Performance Analysis – p.13/22
Framework - Tools• Jahuel:
• Describes software and hardware models in FXML
• Synthesizes executable code for software model (eg : C+posix)
• P-Ware:• Takes hardware and software meta-models instances as input from Jahuel
• Provides joint software and hardware performance prediction by simulation
ARTIST WS 2007 – Modelling and Performance Analysis – p.14/22
Presentation plan• Motivation
• Current practices
• Related work
• Methodology
• Framework• Video encoding platform application
• Constraints synthesis results
• VE Performance results
• Conclusion
ARTIST WS 2007 – Modelling and Performance Analysis – p.15/22
VE platform - Hardware components
CP
U
mem
ory
D−cache
CP
U
mem
ory
D−cache
Controller
Data Bus component
!fire ?end
Arbiter
TR
P1 FIFOP2 FIFO
Buffer
Transactor
TRTR
T10 cycles
TLTB
TR
OK$_{out}$ (state of DRAM)
Output TRto memorybank’s fifos
over Pi FIFORound−robin
software environment
Data busReq bus
CAM HIDISP
OK$_{out}$
TR
DRAM banks
Hardware architecture components
ARTIST WS 2007 – Modelling and Performance Analysis – p.16/22
VE platform - Hardware components
CP
U
mem
ory
D−cache
CP
U
mem
ory
D−cache
software environment
Data busReq bus
Data Bus component
CAM HIDISP
Buffer
DRAM banks
Controller
OKout
TR
!fire
?end
Arbiter
TR
P1 FIFO
P2 FIFO
OKout
Transactor
TR
TR
T
10 cycles
TLTB
TR
OKout (state of DRAM)
Output
TR
to
mem
ory
bank’s
fifo
sover Pi FIFORound-robin
Hardware architecture components
ARTIST WS 2007 – Modelling and Performance Analysis – p.16/22
VE platform - System constraintsparvar k
CAMddddddddddddddddddddd
ddddddddddddddddddddd HIjjjjjjjjj
jjjjjjjjj DISP P1 . . .PnZZZZZZZZZZZZZZZZZZZZZZ
ZZZZZZZZZZZZZZZZZZZZZZ
for(k)
P=1/15
for(k)
P=1/15
for(k)
P=1/15
par
[mutex]
iiiiiiiiiiiiiiiiiiiiii
UUUUUUUUUUUUUUUUUUUUU
for(k)
P=1/15
for(k)
P=1/15
CAMOUT Mem: CAM buffer
HIIN Mem: CAM bufferOUT Mem: HI buffer
(CAM,k)-> (HI,k)
DISPIN Mem: HI buffer(HI,k) -> (DISP,k)
FCIN Mem: HI buffer
OUT Mem: FC buffer1 & FC buffer2(HI,k) -> (FC,k)
VEIN Mem: FC buffer1 & FC buffer
OUT Mem: VE buffer(FC,k)->(VE,k)
• Taking into account WCET of CAM, HI, and DISP
(δCAM = δHI = δDISP = 1
30) we synthesize:
8
>
>
>
>
>
>
>
<
>
>
>
>
>
>
>
:
bHI = bCAM + 1
30HI activated 1
30after CAM
bDISP = bHI + 1
30DISP activated 1
30after HI
bFC = bHI + 1
30FC activated 1
30after HI
bVE = bFC + 1
30VE activated 1
30after FC
δFC + δVE ≤ 1
15Execution time of VE and FC smaller than 1
15
• With a 2
3× 1
25s format converter (δFC = 2
3× 1
25): δVE ≤
1
25
ARTIST WS 2007 – Modelling and Performance Analysis – p.17/22
Implementations of the MPEG-4 VE
MVD : Motion Vector Diff.
Choice
Motion EstimationME :
MVP : Motion Prediction
Encoded picture
Camera
Zigzag
RunLevel
VLC : Variable Length Coding
Layer
ADD
DCTI
QUANTI
reference
Quant : Quantization
DCT : Discrete Cosine Transf.
PRD : Motion Compensation
DIFF : Diff. On Prediction
Extend_rec
reconstructed
Picture
MPEG-4 VE block diagram
• Two Implementations of ENC:
par
CPU1ooo
oooCPU2OOO
OOO
for(k ∈ Gr1) for(k ∈ Gr2)
ENC(1) ENC(2)
HTG with a 2-thread partition of
ENC
par
CPU1ccccccccccccccccccccccc
ccccccccccccccccccccccc CPU2ggggggg
ggggggg CPU3 CPU4XXXXXXXXXX
XXXXXXXXXX
for(k) for(k) for(k) for(k)
MVP, ME, Choice, MVD Prd, DIFF, DCT QUANT, QUANTI DCTI, ADD, extend rec
HTG with a pipeline partition of
ENC
ARTIST WS 2007 – Modelling and Performance Analysis – p.18/22
Performance results of VE• Results using P-Ware:
Seq Hybrid (parallel of pipelines)Parallel
cache misses quotient (in %)Required bandwidth (in 1/800 Mb/s)
(a)
(b)
(c) (d) (e) (f) (g)
execution time (in 1/25 s)
1/25 s
0
0.5
1
1.5
2
2.5
3
3.5
4
0 2 4 6 8 10 12 14 16 0
0.5
1
1.5
2
2.5
3
3.5
4
0 2 4 6 8 10 12 14 16 0
0.5
1
1.5
2
2.5
3
3.5
4
0 2 4 6 8 10 12 14 16 0
0.5
1
1.5
2
2.5
3
3.5
4
0 2 4 6 8 10 12 14 16
Execution time, cache misses and used bandwidth for different implementations
Processors number
• (d), (f) and (g) satisfy execution time constraints
• Hybrid ones, i.e. (f) and (g), produce an increase of bandwidth usage
• The best compromise seems to be (d), consisting of 4 MPEG-threads
ARTIST WS 2007 – Modelling and Performance Analysis – p.19/22
Conclusion - Framework• Component-based modelling framework combining
transaction-level HW and programmer-level SW models• Joint HW and SW modelling and performance analysis
allowing for predicting:1. Impact of HW on SW performance
2. Ability of HW to accommodate future services
• A programming and simulation tools supporting theframework
1. Jahuel
2. P-Ware
ARTIST WS 2007 – Modelling and Performance Analysis – p.20/22
Conclusion - Applications• Application to real-life industrial systems:
• MPEG-4, IPv4, Philips WASABI NoC, and Intel’s IXP2800
• Models expressiveness:• Data granularity is easily set-up using the imlementation of software
dispatchers and/or hardware transactors: bus-packet (eg. a line of pixels)
suited for MEPG-4, and bus-size for IPv4.
• Tool performance:• Scalable prototype: eg. dual IXP NP with 768 memory bank component
• Fast simulation speeds: average 300 000 cycles/s
• Models precision:• Precise performance results: eg. values are within 5% of the ones obtained
by the IXP2800 cycle-accurate simulator for several data granularities
• Correct performance trends
• Joint software and hardware modelling environment:• Automatization using Jahuel
• Fast and efficient system design with user-driven joint software and
hardware performance tracking using P-WareARTIST WS 2007 – Modelling and Performance Analysis – p.21/22
Thank you!
ARTIST WS 2007 – Modelling and Performance Analysis – p.22/22