What is the worst-case execution time of your software?os.inf.tu-dresden.de/Studium/RTS/Folien/WCET_tutorial_TUDresden.pdf · your software?... and how can I improve it? (c) Rapita

Dr. Guillem Bernat (CEO)

What is the worst-case execution time of your software?

... and how can I improve it?

(c) Rapita Systems Ltd. 2007 2

Timing of Real-Time Systems

int f (int x){return 2 * x;

}

● What does this do?

● Does it always do the same thing?

● Does it matter what it does?

● How long does it take?

● Does it always take the same time?

● Does it matter how long it takes?

Execution time profiles• Pathological worst-cases


Overview

•1. Timing analysis and optimization: motivation

•2. Worst-Case Execution Time (WCET) concepts

•3. RapiTime: Measurement-based WCET analysis

•4 Experience report: – “Optimisation of mission computer of HAWK jet trainer”

•5 . Optimization strategies


1. Timing Analysis and Optimization


Motivational example: Brake-by-wire

ECU2

ECU1

The problem• Subsystem:

– ECU1: Control– ECU2: Actuator– Network (e.g. CAN)

• Main control task on ECU1– Reads sensors– Computes outputs– Writes outputs to actuators– Resets watchdog timer– Watchdog timer set to 20ms.

• The system worked fine, but after a small change in the software the steering mechanism became wobbly

• The computation task overran its assumed WCET, failing to reset the watchdog timer and resulting in a very frequent full system reset!

• Significant effort for “post-mortem” analysis

• Q: “how do you prove that the timing is correct.– Main issue is “proving absence of errors”

And the problem gets bigger

And even bigger!


Trends and problems

• Increase of:– Functionality (and therefore SW size)– System Complexity– Hardware complexity– Level of abstraction

• Certification of safety critical SW processes.

• Impact of these trends on timing analysis– Systems more difficult to test– Current testing practices are not adequate for timing

• No traceability of rare faults (timing)• E.g. 100% MC/DC coverage not good enough (more later!)

– Manual approach: • Time consuming, Expensive, Tedious, Error prone


Definition: Software Timing

● To provide software that meets its timing requirements:

– understanding the timing behaviour is as important as understanding the functional behaviour.

● General approach to show timing correctness:

– Understand the requirements (deadlines/throughput)

– Understand the actual behaviour (execution times, scheduling)

– Relate behaviour to requirements by analysis

● One component in any real-time methodology is:

Worst Case Execution Time:

“The maximum length of time that a section of software takes to run on the target.”


Messages...• Timing correctnes...

– Engineering discipline– Not guesswork– Not an afterthought

• Profiling average case ≠ worst-case

• Optimisation opportunities (on the worst-case path)– Low level– Function level– Design level


2. Worst case execution time (WCET) concepts


TerminologyTerminology on execution time

– Best-Case execution time– Average execution time– Maximum observed (possibly unsafe)– Real worst case (possibly non computable) – WCET: upper bound on real-WCET (safe but pessimistic)

– Response time: including interference from other tasks and interrupts.

Average Maximum Observed "Real" Worst-case WCET estimateBest-case

timeTesting not enough Pessimistic analysis


Why use WCET (and not Average)?

• Reliability:– If you know the worst case then (by definition) all other cases are

smaller.– Timing problems happen when execution time budgets are

exceed (not on average situations!)• Stability:

– Different runs of test cases will generate different “average” case times, and different maximum measured times.

– Different runs of test cases show stability of computed worst case times.

• Management of timing as code is developed:– Avoid “suddenly” running out of capacity– Identify likely timing overruns earlier in development.

• Optimization:– Show where to optimize (and where not to)– Show how good your optimizations are– Show effect of optimization on other parts of code


Worst/average case optimisationAverage case optimization:

if ( test1() ) /* Most frequent case */ short_function();elseif ( test2() ) test2_function();elseif ( test3() ); long_function();

Worst case optimization:

if ( test3() ) /* worst case */ long_function();elseif ( test1() ); short_function();elseif ( test2() ) test2_function();

•Increases "average“ (most frequent) execution time

•Decreases time on worst case path


How do you know the WCET?

• Structural behaviour (high level)– What is the worst path through the code?– Maximum iterations of loops and loop interactions– Input data dependencies– History dependent / data dependent– Compiler transformations and optimisations

• Hardware behaviour (low level)– What is the execution time of an instruction?– What is the execution time of the longest path?– Depends on where things are on memory (RAM, FLASH,

ROM,…)– Cache and MMU address translation units– Contention accessing external buses

• Embedded programs are "well behaved and simple" (but not always).• Difficult to analyse automatically generated code


WCET Analysis: approaches– Extensive testing :

• Measuring end to end execution time of the program• High level:

(–) No knowledge of the structure of the program(–) Not sure that the longest path has been tested(–) Expensive, exhaustive testing not feasible in practice

• Low Level:(+) Observes real hardware and software(+) No knowledge/assumptions of the hardware needed

– Static WCET • Analysis of the program without running it. Timing model of CPU.• High-level:

(+) finds the longest path + program structure• Low-level:

(–) HW is (very) difficult to model and validate (DSP, MCU, buses, etc)

(–) Pessimistic (simplistic?)(–) No information on how likely the worst-case behaviour is

– RapiTime Solution: Hybrid approach: • Measurement Based WCET analysis …


WCET Analysis: approaches (2)– Measurement–based WCET analysis: RapiTime

• “The best model of a system is the system itself”

– Combines best features of both approaches • High-level:

(+) Use static analysis techniques to find the longest path through a program

• Low-level:

(+) Determines timings from actual measurements

(+) Full distributions of execution times

(+) No “model” approximations!

(+) Easy and quick to port

– Integration of functional testing and timing analysis• Enables comparing WC measured with WC computed

– Automatic tool support

WCET analysis techniques Overview

• High level analysis– Structure of the program– Determining “flow facts”

• Determining loop bounds• Mutually exclusive paths

– Help by the programmer through code annotations

• Low-level analysis– Execution times of individual machine instructions– Analysis of the interaction of the features of the processor

(cache, pipeline, btb, ...)

Overview of WCET approaches

Measuring timing at basic block level.

Hybrid

Pipeline modelling, cache modelling, branch target buffers, out of order execution, contention to buses, and their interaction…

Tree based (timing schema),Path based,IPET based

+ Flow facts

Purely Static

High Level: path analysis

End-to end measurements N/ABlack box testing

Low Level: HW analysis

(HW) Low-level analysis• Simplistic mode: each instruction takes always the same time to execute

• Simple basic block WCET : WCET(instruction)

• Relies on knowing the WCET of each instruction

– Just read the CPU manual

– Pipelined execution : WCET(Basic Block) < WCET(instr.)

– Caches, Branch Pred, etc. : ET(instruction) is not constant

– Difficult to predict *other* effects

• Interaction of other devices on the board, Memory latencies

• Reducing the pessimism

– Taking into account these HW features

Requires a timing model of each HW feature + their interaction

Proven to be NP-Hard

Σinstruction ∈ BB

Σinstruction ∈ BB

HW modeling problems

• Processor timing model

– Requires a good knowledge of the processor

• Information may be difficult to obtain

• Errors in processor documentation

• Undocumented non-functional behaviour

– Leads to a very complex mathematical model when taking into account several HW features together

– Interaction of the features. Intractable problem

– WCET analysis tools very dependent of the processor

Example Pipeline Analysis• Using “reservation tables”• Timing of each instruction on the different stages of the pipeline• Aim: identify the reservation table of the “sequential” composition, • Extendable to other operations on CFG (join, split)

Images from Peter Puschner, Raimund Kirner. TUVienna

Modeling Caches• Needs a good understanding of cache Architecture

– Types of caches• Instruction cache, data cache, unified

– Cache size, and multiple levels of cache– Memory access times

• Bus contentions• Burst fetches

– Cache layout• Direct mapped, n-way associative

– Replacement strategies• LRU, pseudo round-robin, …

Example execution in cache

00000148 <send_message>: 148: e52de004 str lr, [sp, -#4]! 14c: e3a01c7f mov r1, #32512 ; 0x7f00 150: e3a0c000 mov ip, #0 ; 0x0 154: e28130fb add r3, r1, #251 ; 0xfb 158: e58c3000 str r3, [ip] 15c: e59f3034 ldr r3, [pc, #52] ; 198 <send_message+0x50> 160: e59c2000 ldr r2, [ip] 164: e5830000 str r0, [r3] 168: e58c2000 str r2, [ip] 16c: e28110fa add r1, r1, #250 ; 0xfa 170: e5d0e001 ldrb lr, [r0, #1] 174: e59f3020 ldr r3, [pc, #32] ; 19c <send_message+0x54> 178: e58c1000 str r1, [ip] 17c: e59c0000 ldr r0, [ip] 180: e5c3e000 strb lr, [r3] 184: e59f3014 ldr r3, [pc, #20] ; 1a0 <send_message+0x58> 188: e3a02001 mov r2, #1 ; 0x1 18c: e5c32000 strb r2, [r3] 190: e58c0000 str r0, [ip] 194: e49df004 ldr pc, [sp], #4 ...

voidsend_message( Uint8 * msg ){ tx_buff_ptr = msg; tx_buff_len = msg[LEN]; tx_enable = 1;}

Example impact of layout in memory

void foo( ... ){ int i; for (i=0;i<N;i++) { A(); B(); C(); if(X())

D(); E(); }}

• Assume Cache with Associativity of 4• Function foo calls 5 functions (A-Z) in a loop• The relative position of the code in memory can have a dramatic impact on execution time• This is known as a “Cache killer pattern”• In the best case, A-Z have only one cache miss (in other iterations, they are all loaded in the cache)• In the worst case, every iteration is a cache miss for all modules• The condition (X()), makes this very difficult to find by “testing”. •

Cache address mapping on AT697–2

A div 8 KB

Amod8 KB

I-cache load(blocks/set)

A

C

D

1 2 3 4 5 6 7

E

B

0

Load ≤ 4: all hits

All memory

A EB

C D

Load > 4: 100% misses in these sets

A div 8 KB 1 2 3 4 5 6 70

I-cache load(blocks/set)

Amod8 KB

All memory

(a) Good layout (b) Bad layout

All memory All memory

Addr div 8 KB Addr div 8 KB

“Good” layout

Addrmod8 KB

Addrmod8 KB

I-cache load(blocks/sets)

I-cache load(blocks/sets)

Load ≤ 4 : All hits Load > 4 : 100% misses in these sets

“Bad” layout

Modelling caches (2)• Aim of cache analysis:

– Determine at every memory reference• Always Miss (AM)• Always Hit (AH)• Globally persistent (GP) : First access is “NC”all subsequent

ones are AH• Locally persistent (LP): First access in a “context” is “NC”, all

subsequent ones are AH• Not classified (NC). May be a cache hit, or a cache miss. For

safe approximation, assume all NC are cache misses– Calculation using IPET and Data flow equation DFA models.

– Secondary issue: • Cache conflicts• Optimise memory code layout

Timing Anomalies• When a cache miss results in “shorter” execution time than when

there is a cache hit!• Makes problem of static low-level wcet analysis very difficult• Example 1: due to instruction reordering

Timing Anomalies in Dynamically Scheduled Microprocessors, Thomas Lundqvist, Per Stenstrom, RTSS 1999

D can start one cycle earlier

Timing Anomalies (2)• Example 2: Motorola coldfire MCF 5307

• Example: A cache miss may actually speed up program execution. – MCF 5307 has a united cache and the fetch and execute

pipelines are independent,– A data access hitting in the cache is served directly from the

cache. – At the same time, the fetch pipeline fetches another instruction

block from main memory performing a branch missprediction and replacing two lines of data in the cache.

– Let's assume that they would have been reused later on, but since they were removed cause two misses.

– If the data access were a cache miss, the instruction fetch pipeline would not have fetched these two lines, because the execution pipeline would have resolved the missprediction before those lines were fetched

High Level Analysis• Problem formulation

– Given the execution time of small sections of code in the program– Find the WCET of the longest path

• May not find the “actual” path• To consider:

– Loop bounds, Infeasible paths, Mutually exclusive paths, Contexts of execution, Data dependent parameter passing

• May need to be guided with user annotations

• Object code level– May require heavy user support to help the analysis (published case

study: 5 months of effort to analyse non-interruptible sections of code of an RTOS)

– Analysis at the object code level may be very difficult.– Captures the code generated by the compiler– Difficult to trace back to source

• Source code level– Structure present in the program

High Level Analysis approachesProblem:

– Determine the longest path through a program. – Input : Control flow graph, with variations and annotations

of timing information• Issues:

– Flow fact determination. Constraints on possible paths: e.g. Mutually exclussive paths

– Call contexts. Different calls may not always take the same execution time.

• Due to data inputs (e.g. memcpy() of different sizes). • Or due to HW effects (e.g. Second time of a call may

be in cache)• Calculation

– Path-based approaches– Tree based approaches (timing schema)– IPET: Implicit path enumeration techniques

Path based• Enumerates all paths on the CFG• Finds the longest one based on information of the timing of

basic blocks• Very time consuming. Exponential –time algorithm.

`

Tree based• Build a hierarchical representation of the program from CFG• Timing derived for the leafs of the tree• “Timing schema” to compute WCET of inner nodes• Can handle very large programs• Non-Exponential time analysis algorithm• Program Representations : syntax tree with :

– Leaf nodes: basic blocks– Inner nodes: Sequence, loop, conditional, calls

IPET based• Formulate the WCET problem as a linear maximisation problem.• Solve it using standard linear programming solvers (lpsolve)• Allows interesting restrictions to be described• Execution time can still be prohibitive


RapiTime “minimal example”... /* 10 us*/if( a ){ f1(); /* 20us */}else{ f2(); /* 50us */}... /* 10 us*/if( b ){ f3(); /* 60us */}else{ f4(); /* 5us */}... /* 10 us*/


RapiTime “minimal example”

• Tested, the “green” and “blue” paths• Maximum measured is 110,• RapiTime builds the structure of the program

– Abstract syntax tree– Not tested all possible paths– But worst path computed “statically”

• And determines that the “red” path is possible, and computes WCET of 140.

• NOTE!: 100% MC/DC coverage – But not enough for timing!

20 (f1) 50 (f2)

10

60 (f3) 5 (f4)

10

10

110 85140


RapiTime “minimal example”

... /* 10 us*/if( a ){ f1(); /* 20us */}else{ f2(); /* 50us */}... /*10 us*/if( b ){ f3(); /* 60us */}else{ f4(); /* 5us */}... /* 10 us*/

RPT_Ipoint ( 9);

RPT_Ipoint( 10) ; { } RPT_Ipoint( 11) ; { } RPT_Ipoint( 12) ;

RPT_Ipoint( 13) ; { } RPT_Ipoint( 14) ; { } RPT_Ipoint(15);RPT_Ipoint(16);

Example traceIpoint Timestamp

9 0 ps10 10 us (+10)12 30 us (+20)13 40 us (+10)15 100 us (+60)16 110 us (+10)


RapiTime: how it works

● RapiTime: Software tool that computes the longest time a piece of software (process) can take to run – worst-case execution time: WCET

– on-target profiler for the worst case

● Procedure:1. Run tests on target system

– Use existing test cases if possible

2. Produce a trace of execution

– Capture by logic analyser/ETM etc)

3. RapiTime builds a report


RapiTime Analysis Process

Standard build processCompile linkSources RunExecutable

Tests

+ instrument + extract + collect trace

Traces

WCETAnalysis

WCET reports

Ada GNAT GreenHills Ada Aonix

C GCC tasking dcc Cosmic GreenHills IAR


Tracing Mechanism

● Four general architectures to produce traces:1. Software only (low cost, high overhead)

– Instrumentation points write to memory buffer

– Dump buffer via debugger or serial etc.

2. Software with hardware support

– Instrumentation points write to IO port (etc)

– Logic Analyser or TraceBox timestamps data

3. Full hardware (transparent)

– In-circuit emulator, ETM or Nexus trace etc

– Debugger provides timestamped trace (branch points)

4. Simulated (high cost, low overlead, transparent)

– Software simulator provides a full program counter trace

● Some options best with instrumentation, some without


Instrumentation example

● Instrumentation is automatic

● Configurable:– lots of instrumentation: good report– little instrumentation: less detailed report, but lower overhead

if (a) { c=b();}return f(c+d);

if (a) { RPT_Ipoint(15);c=b();

}{ int tmp=f(c+d); RPT_Ipoint(16); return tmp;}

cins


Example tracing

• Write a number to a memory location– Down to 1 or 2 assembler instructions

• External hardware does tracing, for example:– Debugger:

• Lauterbach Trace32, iSystems, American Arium, ...– Logic analyser:

• Tektronix, Agilent– Specialised TraceBox

• Continuous tracing

/* Write to address */#define RPT_Ipoint(I) (void)((*(volatile unsigned short *)PORT_ADDR) = ( I ))

RapiTime Report


Example: profile

Profile of a small section of code


• Analysis of Mission Computer • Rack of several MPC7410, at 500MHz• Written in Ada, 25 SW partitions • Several 100K LOCS

• Key objective: – Optimise overall execution by 10%– 4 partitions analysed (25% of schedule)

• WCET hotspots– RapiTime identified key WCET hotspots and provided quantification of

optimisation opportunities• Main optimisations done:

– Removing code that created redundant copies of large data structures– Rewriting bit-unpacking code– Enabling compiler to generate much more efficient code (called 700 times)– Rewriting case statements using look-up tables (called 450 times)

• Results– 23% reduction in WCET – By only examining 1.2% of the code

Experience Report: BAE Systems Hawk Jet Trainer


5. Optimization


Optimization

• Optimization is a compromise:– Time– Space– Maintainability– Effort

• At what level can you optimize?– Instruction level (compiler)– Function level (algorithms)– Design level (architecture)


Why you should not “optimize”!

• Maintainability:– optimised code can be hard to read and maintain– was the loss in maintainability worth it?

• No improvement:– are you sure that your code is faster? – On the worst-case path?

• Compiler optimizations:– compilers often faster code if you don't try to be “clever” writing it– But can generate very nasty code from “innocent” source

• hard to predict by inspection of code


Why you have to optimize

• Because you have to improve the code performance....


How to Optimize• Know where to optimize

– high-level first: why do you call a function that many times?– low-level: reduce the execution time of that function

• Know how much time you could save – Prioritisation

• Know how much time you actually save– Worst-case path may switch!– Evidence– Roll back optimizations if the benefit is not worth the loss in

maintainable code

• Understand the worst case behaviour• Understand compiler and target-specific optimizations

• Target: Optimize for maximum improvement with minimum changes.


Compiler-related Hotspots

• Today's compilers are powerful tools– path from high-level language to machine code is complex

• Compilers can do good optimizations– Compilers know about high-level language features, don't try

to use low-level hacks for optimization• e.g. a[i] is often faster than *(a+i)

– Using appropriate compiler switches can make significant difference to execution time, without modifying code.

• Compilers can add heavyweight code that isn't in the source code, hidden code– copying data, – setting up argument lists

• Analysis of program, running on target is the best way to find hidden execution time.


Examples of low-level optimization• Cache locking

– Lock code in the WCET path, not necessarily mostly used path• Data widths and alignments

– Using non-native sizes adds code size/time• Signed/unsigned operations• Multiply and divide can be slow on some architectures

• Inline functions• Compiler pragma/options• Know the processor:

– Optimizations for one processor may result in long execution times on another


Function Level Optimization

• Rewriting functions using techniques such as:– loop unrolling– avoiding extra copies– removing unnecessary checking– lookup tables– data caching– iteration/traversal of data structures– loop jamming (joining loops)– many optimizations are based on moving to single path code– Optimal algorithm complexity. Use the right algorithm for the

task!


Single Path Code

• We are interested in – reducing the worst case execution time – Reducing variability of execution time.

• Example: loop unroll– unroll loop to maximum number of iterations– “average” execution time may increase because we never

exit the loop early– worst case decreases because no tests or branches

• However: must consider:– Is the change really an optimization on a specific target?– Is the optimization worth the extra code overhead and

readability?• Use tool support to quantify impact on timing.


Loop unroll to single path

Loop can execute a maximum of 4 times.for (i=0; i<length; i++){

dest[i]=src[i];}

Unroll to 4 times anyway, but WCET can be shorter because fewer tests and branches.

dest[0]=src[0];dest[1]=src[1];dest[2]=src[2];dest[3]=src[3];


Design Optimizations

• High-level optimizations can give the most benefit (but perhaps at the highest cost)

• Communication model– synchronization, scheduling– Data flows

• Software reuse:– Generic code, may have more functionality than you need.– Unnecessary levels of abstraction


Tool support

• Timing and optimization are not guesswork (any more)– Avoid manual work of instrumenting/measuring– Powerful analysis and reporting– In a large project you need a floodlight, not a candle.

• Tool support tells you:– Where are the worst case hotspots– How much you can save– How much you actually save (or loose!)

Conclusions• Determining WCET is essential for proving timing properties• WCET is an “upper bound”

– Tasks have variability on execution times• Various approaches for calculating WCET with pros and cons

– End-to-end testing– Purely static– Hybrid approach

• In any case, detailed knowledge of the application AND the target is essential

Measurement based WCET Analysis

• Measurement based WCET Analysis.– Avoids the low level modeling– Measure small components of the system– Put together these pieces using high-level static analysis

techniques– Combines best features of both worlds

• Testing– But relies on very good testing data (garbage in, garbage out)– In any case it is “better” than end-to-end testing

• Process– Code instrumentation– Running the program under “good” test environment– Collection of trace data– Analysis as input to high-level analysis

Documents

What is the worst-case execution time of your software?os.inf.tu-dresden.de/Studium/RTS/Folien/WCET_tutorial_TUDresden.pdf · your software?... and how can I improve it? (c) Rapita