31
Architecture-Parametric Timing Analysis Jan Reineke Johannes Doerfert 20th IEEE Real-Time and Embedded Technology and Applications Symposium April 15-17, 2014 Berlin, Germany computer science saarland university

Architecture-Parametric Timing Analysisembedded.cs.uni-saarland.de/presentations/ArchitectureParametricT… · Architecture-Parametric Timing Analysis Jan Reineke ! Johannes Doerfert

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Architecture-Parametric Timing Analysisembedded.cs.uni-saarland.de/presentations/ArchitectureParametricT… · Architecture-Parametric Timing Analysis Jan Reineke ! Johannes Doerfert

A r c h i t e c t u r e - P a r a m e t r i c T i m i n g A n a l y s i s

J a n R e i n e k e !J o h a n n e s D o e r f e r t

20th IEEE Real-Time and Embedded Technology and Applications Symposium April 15-17, 2014 Berlin, Germany

computer science

saarlanduniversity

Page 2: Architecture-Parametric Timing Analysisembedded.cs.uni-saarland.de/presentations/ArchitectureParametricT… · Architecture-Parametric Timing Analysis Jan Reineke ! Johannes Doerfert

A r c h i t e c t u r e - C o n f i g u r a t i o n C h a l l e n g e : A t D e s i g n T i m e

2

Core

Local Cache

Core

Local Cache

Core

Local Cache

Core

Local Cache

Bus

Shared Cache

Page 3: Architecture-Parametric Timing Analysisembedded.cs.uni-saarland.de/presentations/ArchitectureParametricT… · Architecture-Parametric Timing Analysis Jan Reineke ! Johannes Doerfert

A r c h i t e c t u r e - C o n f i g u r a t i o n C h a l l e n g e : A t D e s i g n T i m e

2

Core

Local Cache

Core

Local Cache

Core

Local Cache

Core

Local Cache

Bus

Shared Cache

Maximal Frequency

Page 4: Architecture-Parametric Timing Analysisembedded.cs.uni-saarland.de/presentations/ArchitectureParametricT… · Architecture-Parametric Timing Analysis Jan Reineke ! Johannes Doerfert

A r c h i t e c t u r e - C o n f i g u r a t i o n C h a l l e n g e : A t D e s i g n T i m e

2

Core

Local Cache

Core

Local Cache

Core

Local Cache

Core

Local Cache

Bus

Shared Cache

Maximal Frequency

Size

Size

Page 5: Architecture-Parametric Timing Analysisembedded.cs.uni-saarland.de/presentations/ArchitectureParametricT… · Architecture-Parametric Timing Analysis Jan Reineke ! Johannes Doerfert

A r c h i t e c t u r e - C o n f i g u r a t i o n C h a l l e n g e : A t D e s i g n T i m e

2

Core

Local Cache

Core

Local Cache

Core

Local Cache

Core

Local Cache

Bus

Shared Cache

Maximal Frequency

Latency, Bandwidth

Size

Size

Page 6: Architecture-Parametric Timing Analysisembedded.cs.uni-saarland.de/presentations/ArchitectureParametricT… · Architecture-Parametric Timing Analysis Jan Reineke ! Johannes Doerfert

A r c h i t e c t u r e - C o n f i g u r a t i o n C h a l l e n g e : A t R u n t i m e

3

Core

Local Cache

Core

Local Cache

Core

Local Cache

Core

Local Cache

Bus

Shared Cache

Dynamic Frequency/

Voltage Scaling

TDMA Configuration

Partitioning, Turn off?

Page 7: Architecture-Parametric Timing Analysisembedded.cs.uni-saarland.de/presentations/ArchitectureParametricT… · Architecture-Parametric Timing Analysis Jan Reineke ! Johannes Doerfert

A r c h i t e c t u r e - C o n f i g u r a t i o n C h a l l e n g e : M a n y - C o r e

4

Network on Chip

Main Memory

Core

Local Cache

Core

Local Cache

Core

Local Cache

Core

Local Cache

Bus

Shared Cache

Core

Local Cache

Core

Local Cache

Core

Local Cache

Core

Local Cache

Bus

Shared Cache

Core

Local Cache

Core

Local Cache

Core

Local Cache

Core

Local Cache

Bus

Shared Cache

Core

Local Cache

Core

Local Cache

Core

Local Cache

Core

Local Cache

Bus

Shared CacheConfiguration

Latency, Bandwidth

Page 8: Architecture-Parametric Timing Analysisembedded.cs.uni-saarland.de/presentations/ArchitectureParametricT… · Architecture-Parametric Timing Analysis Jan Reineke ! Johannes Doerfert

A r c h i t e c t u r e - C o n f i g u r a t i o n C h a l l e n g e : M a n y - C o r e

4

Network on Chip

Main Memory

Core

Local Cache

Core

Local Cache

Core

Local Cache

Core

Local Cache

Bus

Shared Cache

Core

Local Cache

Core

Local Cache

Core

Local Cache

Core

Local Cache

Bus

Shared Cache

Core

Local Cache

Core

Local Cache

Core

Local Cache

Core

Local Cache

Bus

Shared Cache

Core

Local Cache

Core

Local Cache

Core

Local Cache

Core

Local Cache

Bus

Shared CacheConfiguration

Latency, Bandwidth

Configuration affects implementation cost, energy consumption, and worst-case execution times!

Page 9: Architecture-Parametric Timing Analysisembedded.cs.uni-saarland.de/presentations/ArchitectureParametricT… · Architecture-Parametric Timing Analysis Jan Reineke ! Johannes Doerfert

A r c h i t e c t u r e - P a r a m e t r i c T i m i n g A n a l y s i s

5

Network on Chip

Main Memory

Core

Local Cache

Core

Local Cache

Core

Local Cache

Core

Local Cache

Bus

Shared Cache

Core

Local Cache

Core

Local Cache

Core

Local Cache

Core

Local Cache

Bus

Shared Cache

Core

Local Cache

Core

Local Cache

Core

Local Cache

Core

Local Cache

Bus

Shared Cache

Core

Local Cache

Core

Local Cache

Core

Local Cache

Core

Local Cache

Bus

Shared Cache

Embedded Software

Configurable Platform

+

Parametric WCET

?

Page 10: Architecture-Parametric Timing Analysisembedded.cs.uni-saarland.de/presentations/ArchitectureParametricT… · Architecture-Parametric Timing Analysis Jan Reineke ! Johannes Doerfert

A r c h i t e c t u r e - P a r a m e t r i c T i m i n g A n a l y s i s

5

Network on Chip

Main Memory

Core

Local Cache

Core

Local Cache

Core

Local Cache

Core

Local Cache

Bus

Shared Cache

Core

Local Cache

Core

Local Cache

Core

Local Cache

Core

Local Cache

Bus

Shared Cache

Core

Local Cache

Core

Local Cache

Core

Local Cache

Core

Local Cache

Bus

Shared Cache

Core

Local Cache

Core

Local Cache

Core

Local Cache

Core

Local Cache

Bus

Shared Cache

Embedded Software

Configurable Platform

+

Parametric WCET

?

Desiderata: • Precise • Efficiently evaluable

Page 11: Architecture-Parametric Timing Analysisembedded.cs.uni-saarland.de/presentations/ArchitectureParametricT… · Architecture-Parametric Timing Analysis Jan Reineke ! Johannes Doerfert

A r c h i t e c t u r e - P a r a m e t r i c T i m i n g A n a l y s i s : “ B l a c k - B o x ” A p p r o a c h

6

Conventional non-parametric timing analysis

Configuration Configuration

WCET

Black-Box WCET

Analysis

Software Binary

Page 12: Architecture-Parametric Timing Analysisembedded.cs.uni-saarland.de/presentations/ArchitectureParametricT… · Architecture-Parametric Timing Analysis Jan Reineke ! Johannes Doerfert

Configuration 1

Configuration 2

Configuration 3

Configuration 4

Configuration 5

Configuration 1

WCET 1

Black-Box WCET

Analysis

Generalize from

Examples

Parametric WCET

Software Binary

Configuration 2

WCET 2Configuration

3WCET 3

Configuration 4

WCET 4Configuration

5WCET 5

A r c h i t e c t u r e - P a r a m e t r i c T i m i n g A n a l y s i s : “ B l a c k - B o x ” A p p r o a c h

7

Page 13: Architecture-Parametric Timing Analysisembedded.cs.uni-saarland.de/presentations/ArchitectureParametricT… · Architecture-Parametric Timing Analysis Jan Reineke ! Johannes Doerfert

Configuration 1

Configuration 2

Configuration 3

Configuration 4

Configuration 5

Configuration 1

WCET 1

Black-Box WCET

Analysis

Generalize from

Examples

Parametric WCET

Software Binary

Configuration 2

WCET 2Configuration

3WCET 3

Configuration 4

WCET 4Configuration

5WCET 5

A r c h i t e c t u r e - P a r a m e t r i c T i m i n g A n a l y s i s : “ B l a c k - B o x ” A p p r o a c h

7

1. How to generalize from examples?

Page 14: Architecture-Parametric Timing Analysisembedded.cs.uni-saarland.de/presentations/ArchitectureParametricT… · Architecture-Parametric Timing Analysis Jan Reineke ! Johannes Doerfert

Configuration 1

Configuration 2

Configuration 3

Configuration 4

Configuration 5

Configuration 1

WCET 1

Black-Box WCET

Analysis

Generalize from

Examples

Parametric WCET

Software Binary

Configuration 2

WCET 2Configuration

3WCET 3

Configuration 4

WCET 4Configuration

5WCET 5

A r c h i t e c t u r e - P a r a m e t r i c T i m i n g A n a l y s i s : “ B l a c k - B o x ” A p p r o a c h

7

1. How to generalize from examples?

2. Where to sample black box?

Page 15: Architecture-Parametric Timing Analysisembedded.cs.uni-saarland.de/presentations/ArchitectureParametricT… · Architecture-Parametric Timing Analysis Jan Reineke ! Johannes Doerfert

R e q u i r e m e n t s f o r S o u n d a n d E f f i c i e n t G e n e r a l i z a t i o n

8

Necessary: Execution times should be monotone in parameters: “higher frequencies yield shorter execution times” “smaller caches yield longer execution times”

Page 16: Architecture-Parametric Timing Analysisembedded.cs.uni-saarland.de/presentations/ArchitectureParametricT… · Architecture-Parametric Timing Analysis Jan Reineke ! Johannes Doerfert

Desirable for efficiency: Execution time should depend linearly on parameters: “doubling the processor frequency will decrease execution time by a factor of two”

R e q u i r e m e n t s f o r S o u n d a n d E f f i c i e n t G e n e r a l i z a t i o n

9

Page 17: Architecture-Parametric Timing Analysisembedded.cs.uni-saarland.de/presentations/ArchitectureParametricT… · Architecture-Parametric Timing Analysis Jan Reineke ! Johannes Doerfert

|cache|=0c=1m=0 WCET=13

|cache|=0c=0m=1 WCET=5

|cache|=4096c=1m=0 WCET=10

|cache|=4096c=0m=1 WCET=3

|cache| >= 4096

WCET= 10*c+3*m

WCET= 13*c+5*m

Yes NoGeneralize

from Examples

1 . H o w t o G e n e r a l i z e f r o m E x a m p l e s ? R e d u c t i o n t o P a r a m e t r i c L i n e a r P r o g r a m m i n g

10

Formulate a series of parametric linear programs, encoding: • Configurations/WCETs obtained from Black Box • Properties that allow to generalize

Page 18: Architecture-Parametric Timing Analysisembedded.cs.uni-saarland.de/presentations/ArchitectureParametricT… · Architecture-Parametric Timing Analysis Jan Reineke ! Johannes Doerfert

|cache|=0c=1m=0 WCET=13

|cache|=0c=0m=1 WCET=5

|cache|=4096c=1m=0 WCET=10

|cache|=4096c=0m=1 WCET=3

|cache| >= 4096

WCET= 10*c+3*m

WCET= 13*c+5*m

Yes NoGeneralize

from Examples

1 . H o w t o G e n e r a l i z e f r o m E x a m p l e s ? R e d u c t i o n t o P a r a m e t r i c L i n e a r P r o g r a m m i n g

10

Formulate a series of parametric linear programs, encoding: • Configurations/WCETs obtained from Black Box • Properties that allow to generalize

See paper for details!

Page 19: Architecture-Parametric Timing Analysisembedded.cs.uni-saarland.de/presentations/ArchitectureParametricT… · Architecture-Parametric Timing Analysis Jan Reineke ! Johannes Doerfert

Configuration 1

Configuration 2

Configuration 3

Configuration 4

Configuration 5

Configuration 1

WCET 1

Black-Box WCET

Analysis

Generalize from

Examples

Parametric WCET

Software Binary

Configuration 2

WCET 2Configuration

3WCET 3

Configuration 4

WCET 4Configuration

5WCET 5

“ B l a c k - B o x ” A p p r o a c h

11

2. Where to sample black box?

Page 20: Architecture-Parametric Timing Analysisembedded.cs.uni-saarland.de/presentations/ArchitectureParametricT… · Architecture-Parametric Timing Analysis Jan Reineke ! Johannes Doerfert

2 . W h e r e t o s a m p l e t h e b l a c k b o x ?

12

Wanted: Small set of configurations that yields precise parametric WCET

fast analysis

= “close” to black-box

“everywhere”

Page 21: Architecture-Parametric Timing Analysisembedded.cs.uni-saarland.de/presentations/ArchitectureParametricT… · Architecture-Parametric Timing Analysis Jan Reineke ! Johannes Doerfert

2 . W h e r e t o s a m p l e t h e b l a c k b o x ? I n c r e m e n t a l S a m p l i n g

13

Software Binary

Generalize from

Examples

Black-Box WCET

Analysis

Parametric WCET

Parametric Lower Bound

Difference less than epsilon?

Return Parametric

WCET Bound

Violating Configuration

Configuration 1 WCET 1

Configuration 2 WCET 2

Configuration 3 WCET 3

Configuration 4 WCET 4

Configuration 5 WCET 5

Page 22: Architecture-Parametric Timing Analysisembedded.cs.uni-saarland.de/presentations/ArchitectureParametricT… · Architecture-Parametric Timing Analysis Jan Reineke ! Johannes Doerfert

2 . W h e r e t o s a m p l e t h e b l a c k b o x ? I n c r e m e n t a l S a m p l i n g

13

Theorem: Algorithm terminates.

Software Binary

Generalize from

Examples

Black-Box WCET

Analysis

Parametric WCET

Parametric Lower Bound

Difference less than epsilon?

Return Parametric

WCET Bound

Violating Configuration

Configuration 1 WCET 1

Configuration 2 WCET 2

Configuration 3 WCET 3

Configuration 4 WCET 4

Configuration 5 WCET 5

Page 23: Architecture-Parametric Timing Analysisembedded.cs.uni-saarland.de/presentations/ArchitectureParametricT… · Architecture-Parametric Timing Analysis Jan Reineke ! Johannes Doerfert

Ta r g e t f o r P r o t o t y p e : A P a r a m e t e r i z e d P r e c i s i o n - T i m e d A r c h i t e c t u r e

14

Parameterized version of the PTARM, a predictable microarchitecture developed within the PRET project. !6 parameters that control • latencies of arithmetic and branch instructions, • latencies of loads and stores to the scratchpads and to DRAM, • sizes of instruction and data scratchpads. !

Page 24: Architecture-Parametric Timing Analysisembedded.cs.uni-saarland.de/presentations/ArchitectureParametricT… · Architecture-Parametric Timing Analysis Jan Reineke ! Johannes Doerfert

Ta r g e t f o r P r o t o t y p e : A P a r a m e t e r i z e d P r e c i s i o n - T i m e d A r c h i t e c t u r e

14

Parameterized version of the PTARM, a predictable microarchitecture developed within the PRET project. !6 parameters that control • latencies of arithmetic and branch instructions, • latencies of loads and stores to the scratchpads and to DRAM, • sizes of instruction and data scratchpads. !For experimental evaluation: • Black-box WCET analysis based on OTAWA • Parameterized PTARM simulator

Page 25: Architecture-Parametric Timing Analysisembedded.cs.uni-saarland.de/presentations/ArchitectureParametricT… · Architecture-Parametric Timing Analysis Jan Reineke ! Johannes Doerfert

E x p e r i m e n t a l E v a l u a t i o n : P r e c i s i o n o f B l a c k B o x

15

Mälardalen benchmarks minus floating-point, recursion, complex switch statements

Black Box/Simulator

TABLE II. BRIEF SUMMARY OF THE BENCHMARKS.Name Size [byte] Brief descriptionadpcm 26852 Adaptive pulse code modulation algorithm.bs 4248 Binary search for the array of 15 integer elements.bsort100 2779 Bubblesort program.crc 5168 Cyclic redundancy check computation on 40 bytes of data.fdct 8863 Fast Discrete Cosine Transform.fibcall 3499 Simple iterative Fibonacci calculation, to calculate fib(30).insertsort 3892 Insertion sort on a reversed array of size 10.janne complex 1564 Nested loop program.jfdctint 16028 Discrete-cosine transformation on a 8x8 pixel block.matmult 3737 Matrix multiplication of two 20x20 matrices.ns 10436 Search in a multi-dimensional array.nsichneu 11835 Simulate an extended Petri Net.qsort-exam 4535 Non-recursive version of quick sort algorithm.statemate 52618 Automatically generated code.

GNU ARM toolchain including GCC version 4.3.2. Timemeasurements were performed on an INTEL CORE I7 920running at 2.67 GHz with 12 GB of RAM. We chose 64 KBas the maximal value for both the instruction and the datascratchpad memories. Linear parameters, modeling latencies,may take any rational value between 0 and 10.

B. Evaluation Results

Our first evaluation goal is to confirm that the black-box WCET analysis over-approximates the timing of theparameterized PTARM, and to evaluate its precision. To thisend, we determine for each benchmark the ratio between theblack-box WCET estimate and the execution time determinedusing the PTARM simulator in a single simulation run with theinputs that are provided with the MALARDALEN benchmarks.As the value analysis in the black box is very simple and thusbound to be imprecise, we perform this comparison with alllinear parameters, including the DRAM latencies, set to 1. Thiseliminates the influence of the value analysis from the results.The results of this analysis are illustrated in Table III. For somebenchmarks the black-box estimate is very close to the simula-tion result, yet for others the ratio is extremely large. This is dueto imprecise loop bounds and other constraints on the controlflow, and to the fact that the input exercised during simulationdoes not represent the worst-case input, e.g., in the sorting tasks.

Next, we evaluate how the number of black-box WCETsamples affects the precision of the parametric analysis resultson unsampled parameter vectors. To this end, we modifyAlgorithm 3 to report upper and lower bounds whenever a newsample has been taken. This yields two ASTs, �

P,i

and �

P,i

,corresponding to the lower and upper bounds on the black box,for each benchmark P in the set of benchmarks P and numberof samples i. Then, we sample the parameter space uniformly atrandom 100 times. For each of the randomly drawn parametervaluations (�

j

j

), we evaluate the black box BBP

, and theupper and lower bounds �

P,i

,�

P,i

, and determine their ratios:

r

overP,i,j

:=

J�P,i

K(�j

j

)

BBP

(�

j

j

)

and r

underP,i,j

:=

J�P,i

K(�j

j

)

BBP

(�

j

j

)

.

We summarize these ratios by taking their geometric meansr

overi

, r

underi

over all benchmarks found in Table III. InFigure 3, we depict r

overi

and r

underi

for i between two2 and26. The experiment was performed with a precision targetof (✏ = 1024, ⌧ = (0, 0))

3, which ensures that an arbitrarynumber of samples can be taken. We observe a strong precision

2We start at two samples, because Algorithm 3 performs two samples beforeentering the refinement loop.

3Where ⌧ refers to the monotone parameters µI-SPM-Size and µD-SPM-Size.

TABLE III. PRECISION OF THE BLACK-BOX WCET ANALYSIS.

Name Black Box (cycles) Simulator (cycles) Ratioadpcm 9989637 1598152 6.25bs 318 279 1.14bsort100 998109 8293 120.36crc 248231 116995 2.12fdct 11262 11069 1.02fibcall 1140 1131 1.01insertsort 4965 2949 1.68janne complex 4048 753 5.38jfdctint 14016 13951 1.00matmult 755274 745669 1.01ns 42550 42549 1.00nsichneu 32339 15551 2.08qsort-exam 2132100 11125 191.65statemate 108766 2809 38.72

2 3 4 5 6 7 8 9 10111213141516171819202122232425260

1

2

Number of samples

Rat

io

roveri

runderi

roveri,random

Fig. 3. Ratios between upper bound and black box roveri , rover

i,random and ratiobetween lower bound and black box runder

i in terms of the number of samples i.

improvement on samples 3 to 7. After the first 7 samples, theobtained lower bound is very close to the actual black-boxvalues for all benchmarks. The upper bound comes within 5%

of the black box after 16 samples for most benchmarks, whichis reflected by the geometric mean in the figure. For mostbenchmarks, the algorithm chooses to sample the black box ata different instruction scratchpad size than before, at samples12 and 16, which yields a significant precision improvement.

As a baseline for Algorithm 3, we determine how well upperbounds based on a random set of samples approximate the blackbox. In Figure 3, rover

i,random denotes the ratio between these upperbounds, based on i random samples, and the black box. Randomsampling yields much less precise estimates. In addition,computation times increase dramatically, which explains whywe only perform this experiment for up to 15 samples. Thisdemonstrates that some form of “intelligent” sampling isrequired for the black-box approach to be precise and efficient.

To evaluate analysis efficiency, we determined the analysistime of each benchmark up to and including the i

th sample.We decompose this analysis time into three components:

1) Invocations of the black box.2) Operations on affine selection trees, e.g. minimization.3) Invocations of PIPLIB.

In Figure 4 we show the geometric mean of the analysis timeover all benchmarks up to the i

th sample. The bars are stacked,meaning that the top of the upper most bar reflects the overallanalysis time. As the black box is called once for each sample,its contribution to the overall analysis time grows linearly. Asuperlinear growth is observed for the other two components,which is expected, as the problems to be solved grow witheach sample. PIPLIB’s contribution grows strongest, movingfrom the smallest to the largest share of analysis time. Asobserved in Figure 3, 16 samples usually result in very preciseparametric upper bounds. On our benchmarks, 16 samples areprocessed in less than 2.2 seconds on the average.

Page 26: Architecture-Parametric Timing Analysisembedded.cs.uni-saarland.de/presentations/ArchitectureParametricT… · Architecture-Parametric Timing Analysis Jan Reineke ! Johannes Doerfert

E x p e r i m e n t a l E v a l u a t i o n : P r e c i s i o n i n Te r m s o f N u m b e r o f S a m p l e s

16

2 3 4 5 6 7 8 9 10111213141516171819202122232425260

1

2

Number of samples

Rat

io

Fig. 4. Ratios between upper bound and black box roveri , rover

i,random and ratiobetween lower bound and black box runder

i in terms of the number of samples i.

10 20 30 400

10,000

20,000

Number of samples

Ana

lysi

sTi

me

(inm

s)

PIPLIB

AST OperationsBlack Box

Fig. 5. Analysis time in terms of number of samples.

observed in Figure 4, 16 samples usually result in very preciseparametric upper bounds. On our benchmarks, 16 samples areprocessed in less than 2.2 seconds on the average.

In Figure 4 we have shown how the actual precisiondepends on the number of samples. Next, we determinehow many samples are required to meet a certain precisionguarantee. To this end, we determine for each benchmarkand for several precision targets (✏, (⌧, ⌧))

4 the number ofsamples s

P

(✏, ⌧) that Algorithm 3 takes until termination.We summarize these values in Figure 6. Three benchmarks,nsichneu, adpcm, statemate, ran out of memory forsmaller values of ⌧ . For the tightest precision requirement,✏ = 32, we report the maximum required number of sampless

max32

(⌧) = max

P2P s

P

(32, ⌧) over all benchmarks for whichthe analysis terminated successfully, for ⌧ between 8192 and262144. For the loosest precision requirement, ✏ = 1024, onthe other hand, we report the minimum required number ofsamples s

min1024

(⌧) over all benchmarks. For ✏ between 32 and1024, s

P

(✏, ⌧) is expected to lie between s

min1024

(⌧) and s

max32

(⌧).We also report the median number of samples s

median256

(⌧) overall benchmarks for ✏ = 256. For most benchmarks, the numberof required samples is quite insensitive to ✏: the median for✏ = 256 is close to the minimum for ✏ = 1024.

VIII. APPLICATION TO COMMERCIALMICROARCHITECTURES

We have applied APTA to a parameterized version ofthe PTARM, a precision-timed architecture. It has beendesigned with the specific goal of reconciling performanceand predictability. And indeed, as we demonstrate, precise andefficient architecture-parametric WCET analysis is feasible forthe PTARM. Naturally, the question arises whether our approachis also applicable to other existing academic and commercial

4Where (⌧, ⌧) refers to the monotone parameters µI-SPM-Size and µD-SPM-Size.

262144 131072 65536 32768 16384 81920

50

100

excl. nsichneu

excl. adpcm, statemate

Precision Requirement ⌧

Num

ber

ofsa

mpl

es

Fig. 6. Number of samples required to reach precision guarantee (✏, (⌧, ⌧)).Some benchmarks, namely nsichneu, adpcm, statemate, ran out ofmemory for ✏ = 32 and are thus not included in smax

32 (⌧). The median andminimum, smedian

256 (⌧) and smin1024(⌧), could be determined for all values of ⌧ .

microarchitectures and whether a similar parameterization isreasonable for such microarchitectures.

To answer the second question first, we believe that the pa-rameterization of the PTARM is quite typical for both commer-cial and research platforms: aspects that are often configurablein single-core processors are the processor frequency, the sizesof local memories (caches or scratchpads), and the interconnect,affecting memory access latencies, corresponding directly to theparameters in the PTARM. To use multi-core architectures in ahard real-time context, most of their shared resources will haveto be partitioned (an approach promoted among others in theMERASA and Predator projects [15], [3]) in space and/or time:caches can be partitioned along their ways [16], buses and otherinterconnect can be partitioned in time by time division multipleaccess (TDMA) arbitration [17], and access to DRAM memorycan be partitioned in time [18] and space [10]. Partitioningin space intuitively induces monotone parameters, whereastime-based partitioning may sometimes be modeled with linearparameters, however, caveats exist, as discussed below.

For our approach to be applicable to a particular platform,its parameterized timing model needs to be linear or at leastmonotone in all of its parameters. This is, unfortunately, not thecase for canonical models of many complex microarchitectures.In contrast to LRU, FIFO cache replacement is known tosuffer from Belady’s anomaly, i.e., under FIFO, increasingthe cache’s size may lead to a decrease in performance. Forplatforms including such non-monotone features, monotoneparameterized timing models can be developed, however, atthe cost of a loss in precision. Alternatively, if the goal is tofind a system configuration statically, our analysis may alsobe performed on timing models that are not guaranteed to bemonotone. This yields parametric WCET estimations that arenot necessarily safe. Once a system configuration has beendetermined based on such an estimation, the black box canstill be used to verify whether the timing constraints can bemet. If not, the parametric WCET estimation can be refinedaccordingly and the process would have to be iterated.

In addition to monotonicity, our approach requires timing tobe decomposed into contributions that can be attributed to differ-ent components. Such a decomposition is natural for so-calledfully timing-compositional architectures [3], [19]. An exampleof a fully timing-compositional commercial architecture is theARM7 [3]. For more complex architectures such as the InfineonTriCore or the PowerPC 755, it is possible to create conservativecompositional models. The resulting loss in precision may, how-ever, be substantial and depends on the particular architectureand decomposition [19], and is a topic of ongoing research.

Parametric WCET/Black Box

Parametric Lower Bound/Black Box

Geometric Mean

Page 27: Architecture-Parametric Timing Analysisembedded.cs.uni-saarland.de/presentations/ArchitectureParametricT… · Architecture-Parametric Timing Analysis Jan Reineke ! Johannes Doerfert

TABLE II. BRIEF SUMMARY OF THE BENCHMARKS.Name Size [byte] Brief descriptionadpcm 26852 Adaptive pulse code modulation algorithm.bs 4248 Binary search for the array of 15 integer elements.bsort100 2779 Bubblesort program.crc 5168 Cyclic redundancy check computation on 40 bytes of data.fdct 8863 Fast Discrete Cosine Transform.fibcall 3499 Simple iterative Fibonacci calculation, to calculate fib(30).insertsort 3892 Insertion sort on a reversed array of size 10.janne complex 1564 Nested loop program.jfdctint 16028 Discrete-cosine transformation on a 8x8 pixel block.matmult 3737 Matrix multiplication of two 20x20 matrices.ns 10436 Search in a multi-dimensional array.nsichneu 11835 Simulate an extended Petri Net.qsort-exam 4535 Non-recursive version of quick sort algorithm.statemate 52618 Automatically generated code.

GNU ARM toolchain including GCC version 4.3.2. Timemeasurements were performed on an INTEL CORE I7 920running at 2.67 GHz with 12 GB of RAM. We chose 64 KBas the maximal value for both the instruction and the datascratchpad memories. Linear parameters, modeling latencies,may take any rational value between 0 and 10.

B. Evaluation Results

Our first evaluation goal is to confirm that the black-box WCET analysis over-approximates the timing of theparameterized PTARM, and to evaluate its precision. To thisend, we determine for each benchmark the ratio between theblack-box WCET estimate and the execution time determinedusing the PTARM simulator in a single simulation run with theinputs that are provided with the MALARDALEN benchmarks.As the value analysis in the black box is very simple and thusbound to be imprecise, we perform this comparison with alllinear parameters, including the DRAM latencies, set to 1. Thiseliminates the influence of the value analysis from the results.The results of this analysis are illustrated in Table III. For somebenchmarks the black-box estimate is very close to the simula-tion result, yet for others the ratio is extremely large. This is dueto imprecise loop bounds and other constraints on the controlflow, and to the fact that the input exercised during simulationdoes not represent the worst-case input, e.g., in the sorting tasks.

Next, we evaluate how the number of black-box WCETsamples affects the precision of the parametric analysis resultson unsampled parameter vectors. To this end, we modifyAlgorithm 3 to report upper and lower bounds whenever a newsample has been taken. This yields two ASTs, �

P,i

and �

P,i

,corresponding to the lower and upper bounds on the black box,for each benchmark P in the set of benchmarks P and numberof samples i. Then, we sample the parameter space uniformly atrandom 100 times. For each of the randomly drawn parametervaluations (�

j

j

), we evaluate the black box BBP

, and theupper and lower bounds �

P,i

,�

P,i

, and determine their ratios:

r

overP,i,j

:=

J�P,i

K(�j

j

)

BBP

(�

j

j

)

and r

underP,i,j

:=

J�P,i

K(�j

j

)

BBP

(�

j

j

)

.

We summarize these ratios by taking their geometric meansr

overi

, r

underi

over all benchmarks found in Table III. InFigure 4, we depict r

overi

and r

underi

for i between two2 and26. The experiment was performed with a precision targetof (✏ = 1024, ⌧ = (0, 0))

3, which ensures that an arbitrarynumber of samples can be taken. We observe a strong precision

2We start at two samples, because Algorithm 3 performs two samples beforeentering the refinement loop.

3Where ⌧ refers to the monotone parameters µI-SPM-Size and µD-SPM-Size.

TABLE III. PRECISION OF THE BLACK-BOX WCET ANALYSIS.

Name Black Box (cycles) Simulator (cycles) Ratioadpcm 9989637 1598152 6.25bs 318 279 1.14bsort100 998109 8293 120.36crc 248231 116995 2.12fdct 11262 11069 1.02fibcall 1140 1131 1.01insertsort 4965 2949 1.68janne complex 4048 753 5.38jfdctint 14016 13951 1.00matmult 755274 745669 1.01ns 42550 42549 1.00nsichneu 32339 15551 2.08qsort-exam 2132100 11125 191.65statemate 108766 2809 38.72

2 3 4 5 6 7 8 9 10111213141516171819202122232425260

1

2

Number of samples

Rat

io

Fig. 3. Ratios between upper bound and black box roveri , rover

i,random and ratiobetween lower bound and black box runder

i in terms of the number of samples i.

improvement on samples 3 to 7. After the first 7 samples, theobtained lower bound is very close to the actual black-boxvalues for all benchmarks. The upper bound comes within 5%

of the black box after 16 samples for most benchmarks, whichis reflected by the geometric mean in the figure. For mostbenchmarks, the algorithm chooses to sample the black box ata different instruction scratchpad size than before, at samples12 and 16, which yields a significant precision improvement.

As a baseline for Algorithm 3, we determine how well upperbounds based on a random set of samples approximate the blackbox. In Figure 4, rover

i,random denotes the ratio between these upperbounds, based on i random samples, and the black box. Randomsampling yields much less precise estimates. In addition,computation times increase dramatically, which explains whywe only perform this experiment for up to 15 samples. Thisdemonstrates that some form of “intelligent” sampling isrequired for the black-box approach to be precise and efficient.

To evaluate analysis efficiency, we determined the analysistime of each benchmark up to and including the i

th sample.We decompose this analysis time into three components:

1) Invocations of the black box.2) Operations on affine selection trees, e.g. minimization.3) Invocations of PIPLIB.

In Figure 5 we show the geometric mean of the analysis timeover all benchmarks up to the i

th sample. The bars are stacked,meaning that the top of the upper most bar reflects the overallanalysis time. As the black box is called once for each sample,its contribution to the overall analysis time grows linearly. Asuperlinear growth is observed for the other two components,which is expected, as the problems to be solved grow witheach sample. PIPLIB’s contribution grows strongest, movingfrom the smallest to the largest share of analysis time. As

E x p e r i m e n t a l E v a l u a t i o n : Ve r s u s R a n d o m S a m p l i n g

17

Random Samples

Page 28: Architecture-Parametric Timing Analysisembedded.cs.uni-saarland.de/presentations/ArchitectureParametricT… · Architecture-Parametric Timing Analysis Jan Reineke ! Johannes Doerfert

E x p e r i m e n t a l E v a l u a t i o n : A n a l y s i s T i m e i n Te r m s o f N u m b e r o f S a m p l e s

18

2 3 4 5 6 7 8 9 10111213141516171819202122232425260

1

2

Number of samplesR

atio

Fig. 4. Ratios between upper bound and black box roveri , rover

i,random and ratiobetween lower bound and black box runder

i in terms of the number of samples i.

10 20 30 400

10,000

20,000

Number of samples

Ana

lysi

sTi

me

(inm

s)

PIPLIB

AST OperationsBlack Box

Fig. 5. Analysis time in terms of number of samples.

observed in Figure 4, 16 samples usually result in very preciseparametric upper bounds. On our benchmarks, 16 samples areprocessed in less than 2.2 seconds on the average.

In Figure 4 we have shown how the actual precisiondepends on the number of samples. Next, we determinehow many samples are required to meet a certain precisionguarantee. To this end, we determine for each benchmarkand for several precision targets (✏, (⌧, ⌧))

4 the number ofsamples s

P

(✏, ⌧) that Algorithm 3 takes until termination.We summarize these values in Figure 6. Three benchmarks,nsichneu, adpcm, statemate, ran out of memory forsmaller values of ⌧ . For the tightest precision requirement,✏ = 32, we report the maximum required number of sampless

max32

(⌧) = max

P2P s

P

(32, ⌧) over all benchmarks for whichthe analysis terminated successfully, for ⌧ between 8192 and262144. For the loosest precision requirement, ✏ = 1024, onthe other hand, we report the minimum required number ofsamples s

min1024

(⌧) over all benchmarks. For ✏ between 32 and1024, s

P

(✏, ⌧) is expected to lie between s

min1024

(⌧) and s

max32

(⌧).We also report the median number of samples s

median256

(⌧) overall benchmarks for ✏ = 256. For most benchmarks, the numberof required samples is quite insensitive to ✏: the median for✏ = 256 is close to the minimum for ✏ = 1024.

VIII. APPLICATION TO COMMERCIALMICROARCHITECTURES

We have applied APTA to a parameterized version ofthe PTARM, a precision-timed architecture. It has beendesigned with the specific goal of reconciling performanceand predictability. And indeed, as we demonstrate, precise andefficient architecture-parametric WCET analysis is feasible forthe PTARM. Naturally, the question arises whether our approachis also applicable to other existing academic and commercial

4Where (⌧, ⌧) refers to the monotone parameters µI-SPM-Size and µD-SPM-Size.

262144 131072 65536 32768 16384 81920

50

100

excl. nsichneu

excl. adpcm, statemate

Precision Requirement ⌧

Num

ber

ofsa

mpl

es

Fig. 6. Number of samples required to reach precision guarantee (✏, (⌧, ⌧)).Some benchmarks, namely nsichneu, adpcm, statemate, ran out ofmemory for ✏ = 32 and are thus not included in smax

32 (⌧). The median andminimum, smedian

256 (⌧) and smin1024(⌧), could be determined for all values of ⌧ .

microarchitectures and whether a similar parameterization isreasonable for such microarchitectures.

To answer the second question first, we believe that the pa-rameterization of the PTARM is quite typical for both commer-cial and research platforms: aspects that are often configurablein single-core processors are the processor frequency, the sizesof local memories (caches or scratchpads), and the interconnect,affecting memory access latencies, corresponding directly to theparameters in the PTARM. To use multi-core architectures in ahard real-time context, most of their shared resources will haveto be partitioned (an approach promoted among others in theMERASA and Predator projects [15], [3]) in space and/or time:caches can be partitioned along their ways [16], buses and otherinterconnect can be partitioned in time by time division multipleaccess (TDMA) arbitration [17], and access to DRAM memorycan be partitioned in time [18] and space [10]. Partitioningin space intuitively induces monotone parameters, whereastime-based partitioning may sometimes be modeled with linearparameters, however, caveats exist, as discussed below.

For our approach to be applicable to a particular platform,its parameterized timing model needs to be linear or at leastmonotone in all of its parameters. This is, unfortunately, not thecase for canonical models of many complex microarchitectures.In contrast to LRU, FIFO cache replacement is known tosuffer from Belady’s anomaly, i.e., under FIFO, increasingthe cache’s size may lead to a decrease in performance. Forplatforms including such non-monotone features, monotoneparameterized timing models can be developed, however, atthe cost of a loss in precision. Alternatively, if the goal is tofind a system configuration statically, our analysis may alsobe performed on timing models that are not guaranteed to bemonotone. This yields parametric WCET estimations that arenot necessarily safe. Once a system configuration has beendetermined based on such an estimation, the black box canstill be used to verify whether the timing constraints can bemet. If not, the parametric WCET estimation can be refinedaccordingly and the process would have to be iterated.

In addition to monotonicity, our approach requires timing tobe decomposed into contributions that can be attributed to differ-ent components. Such a decomposition is natural for so-calledfully timing-compositional architectures [3], [19]. An exampleof a fully timing-compositional commercial architecture is theARM7 [3]. For more complex architectures such as the InfineonTriCore or the PowerPC 755, it is possible to create conservativecompositional models. The resulting loss in precision may, how-ever, be substantial and depends on the particular architectureand decomposition [19], and is a topic of ongoing research.

Page 29: Architecture-Parametric Timing Analysisembedded.cs.uni-saarland.de/presentations/ArchitectureParametricT… · Architecture-Parametric Timing Analysis Jan Reineke ! Johannes Doerfert

E x p e r i m e n t a l E v a l u a t i o n : A n a l y s i s T i m e i n Te r m s o f N u m b e r o f S a m p l e s

18

2 3 4 5 6 7 8 9 10111213141516171819202122232425260

1

2

Number of samplesR

atio

Fig. 4. Ratios between upper bound and black box roveri , rover

i,random and ratiobetween lower bound and black box runder

i in terms of the number of samples i.

10 20 30 400

10,000

20,000

Number of samples

Ana

lysi

sTi

me

(inm

s)

PIPLIB

AST OperationsBlack Box

Fig. 5. Analysis time in terms of number of samples.

observed in Figure 4, 16 samples usually result in very preciseparametric upper bounds. On our benchmarks, 16 samples areprocessed in less than 2.2 seconds on the average.

In Figure 4 we have shown how the actual precisiondepends on the number of samples. Next, we determinehow many samples are required to meet a certain precisionguarantee. To this end, we determine for each benchmarkand for several precision targets (✏, (⌧, ⌧))

4 the number ofsamples s

P

(✏, ⌧) that Algorithm 3 takes until termination.We summarize these values in Figure 6. Three benchmarks,nsichneu, adpcm, statemate, ran out of memory forsmaller values of ⌧ . For the tightest precision requirement,✏ = 32, we report the maximum required number of sampless

max32

(⌧) = max

P2P s

P

(32, ⌧) over all benchmarks for whichthe analysis terminated successfully, for ⌧ between 8192 and262144. For the loosest precision requirement, ✏ = 1024, onthe other hand, we report the minimum required number ofsamples s

min1024

(⌧) over all benchmarks. For ✏ between 32 and1024, s

P

(✏, ⌧) is expected to lie between s

min1024

(⌧) and s

max32

(⌧).We also report the median number of samples s

median256

(⌧) overall benchmarks for ✏ = 256. For most benchmarks, the numberof required samples is quite insensitive to ✏: the median for✏ = 256 is close to the minimum for ✏ = 1024.

VIII. APPLICATION TO COMMERCIALMICROARCHITECTURES

We have applied APTA to a parameterized version ofthe PTARM, a precision-timed architecture. It has beendesigned with the specific goal of reconciling performanceand predictability. And indeed, as we demonstrate, precise andefficient architecture-parametric WCET analysis is feasible forthe PTARM. Naturally, the question arises whether our approachis also applicable to other existing academic and commercial

4Where (⌧, ⌧) refers to the monotone parameters µI-SPM-Size and µD-SPM-Size.

262144 131072 65536 32768 16384 81920

50

100

excl. nsichneu

excl. adpcm, statemate

Precision Requirement ⌧

Num

ber

ofsa

mpl

es

Fig. 6. Number of samples required to reach precision guarantee (✏, (⌧, ⌧)).Some benchmarks, namely nsichneu, adpcm, statemate, ran out ofmemory for ✏ = 32 and are thus not included in smax

32 (⌧). The median andminimum, smedian

256 (⌧) and smin1024(⌧), could be determined for all values of ⌧ .

microarchitectures and whether a similar parameterization isreasonable for such microarchitectures.

To answer the second question first, we believe that the pa-rameterization of the PTARM is quite typical for both commer-cial and research platforms: aspects that are often configurablein single-core processors are the processor frequency, the sizesof local memories (caches or scratchpads), and the interconnect,affecting memory access latencies, corresponding directly to theparameters in the PTARM. To use multi-core architectures in ahard real-time context, most of their shared resources will haveto be partitioned (an approach promoted among others in theMERASA and Predator projects [15], [3]) in space and/or time:caches can be partitioned along their ways [16], buses and otherinterconnect can be partitioned in time by time division multipleaccess (TDMA) arbitration [17], and access to DRAM memorycan be partitioned in time [18] and space [10]. Partitioningin space intuitively induces monotone parameters, whereastime-based partitioning may sometimes be modeled with linearparameters, however, caveats exist, as discussed below.

For our approach to be applicable to a particular platform,its parameterized timing model needs to be linear or at leastmonotone in all of its parameters. This is, unfortunately, not thecase for canonical models of many complex microarchitectures.In contrast to LRU, FIFO cache replacement is known tosuffer from Belady’s anomaly, i.e., under FIFO, increasingthe cache’s size may lead to a decrease in performance. Forplatforms including such non-monotone features, monotoneparameterized timing models can be developed, however, atthe cost of a loss in precision. Alternatively, if the goal is tofind a system configuration statically, our analysis may alsobe performed on timing models that are not guaranteed to bemonotone. This yields parametric WCET estimations that arenot necessarily safe. Once a system configuration has beendetermined based on such an estimation, the black box canstill be used to verify whether the timing constraints can bemet. If not, the parametric WCET estimation can be refinedaccordingly and the process would have to be iterated.

In addition to monotonicity, our approach requires timing tobe decomposed into contributions that can be attributed to differ-ent components. Such a decomposition is natural for so-calledfully timing-compositional architectures [3], [19]. An exampleof a fully timing-compositional commercial architecture is theARM7 [3]. For more complex architectures such as the InfineonTriCore or the PowerPC 755, it is possible to create conservativecompositional models. The resulting loss in precision may, how-ever, be substantial and depends on the particular architectureand decomposition [19], and is a topic of ongoing research.

16 Samples: ~ 2.2 seconds

Page 30: Architecture-Parametric Timing Analysisembedded.cs.uni-saarland.de/presentations/ArchitectureParametricT… · Architecture-Parametric Timing Analysis Jan Reineke ! Johannes Doerfert

C o n c l u s i o n s a n d F u t u r e W o r k

19

First general framework for architecture-parametric timing analysis.

!

Future Work: • Parametric schedulability analysis • Integrate into a design-space exploration • Study applicability to commercial microarchitectures • “White-box” approach

Page 31: Architecture-Parametric Timing Analysisembedded.cs.uni-saarland.de/presentations/ArchitectureParametricT… · Architecture-Parametric Timing Analysis Jan Reineke ! Johannes Doerfert

C o n c l u s i o n s a n d F u t u r e W o r k

19

First general framework for architecture-parametric timing analysis.

!

Future Work: • Parametric schedulability analysis • Integrate into a design-space exploration • Study applicability to commercial microarchitectures • “White-box” approach

Thank you for your attention!