Upload
others
View
0
Download
0
Embed Size (px)
Citation preview
Dr. Guillem Bernat (CEO)
What is the worst-case execution time of your software?
... and how can I improve it?
(c) Rapita Systems Ltd. 2007 2
Timing of Real-Time Systems
int f (int x){return 2 * x;
}
● What does this do?
● Does it always do the same thing?
● Does it matter what it does?
● How long does it take?
● Does it always take the same time?
● Does it matter how long it takes?
Execution time profiles• Pathological worst-cases
(c) Rapita Systems Ltd. 2007 4
Overview
•1. Timing analysis and optimization: motivation
•2. Worst-Case Execution Time (WCET) concepts
•3. RapiTime: Measurement-based WCET analysis
•4 Experience report: – “Optimisation of mission computer of HAWK jet trainer”
•5 . Optimization strategies
(c) Rapita Systems Ltd. 2007 5
1. Timing Analysis and Optimization
(c) Rapita Systems Ltd. 2007 6
Motivational example: Brake-by-wire
ECU2
ECU1
The problem• Subsystem:
– ECU1: Control– ECU2: Actuator– Network (e.g. CAN)
• Main control task on ECU1– Reads sensors– Computes outputs– Writes outputs to actuators– Resets watchdog timer– Watchdog timer set to 20ms.
• The system worked fine, but after a small change in the software the steering mechanism became wobbly
• The computation task overran its assumed WCET, failing to reset the watchdog timer and resulting in a very frequent full system reset!
• Significant effort for “post-mortem” analysis
• Q: “how do you prove that the timing is correct.– Main issue is “proving absence of errors”
And the problem gets bigger
And even bigger!
(c) Rapita Systems Ltd. 2007 10
Trends and problems
• Increase of:– Functionality (and therefore SW size)– System Complexity– Hardware complexity– Level of abstraction
• Certification of safety critical SW processes.
• Impact of these trends on timing analysis– Systems more difficult to test– Current testing practices are not adequate for timing
• No traceability of rare faults (timing)• E.g. 100% MC/DC coverage not good enough (more later!)
– Manual approach: • Time consuming, Expensive, Tedious, Error prone
(c) Rapita Systems Ltd. 2007 11
Definition: Software Timing
● To provide software that meets its timing requirements:
– understanding the timing behaviour is as important as understanding the functional behaviour.
● General approach to show timing correctness:
– Understand the requirements (deadlines/throughput)
– Understand the actual behaviour (execution times, scheduling)
– Relate behaviour to requirements by analysis
● One component in any real-time methodology is:
Worst Case Execution Time:
“The maximum length of time that a section of software takes to run on the target.”
(c) Rapita Systems Ltd. 2007 12
Messages...• Timing correctnes...
– Engineering discipline– Not guesswork– Not an afterthought
• Profiling average case ≠ worst-case
• Optimisation opportunities (on the worst-case path)– Low level– Function level– Design level
(c) Rapita Systems Ltd. 2007 13
2. Worst case execution time (WCET) concepts
(c) Rapita Systems Ltd. 2007 14
TerminologyTerminology on execution time
– Best-Case execution time– Average execution time– Maximum observed (possibly unsafe)– Real worst case (possibly non computable) – WCET: upper bound on real-WCET (safe but pessimistic)
– Response time: including interference from other tasks and interrupts.
Average Maximum Observed "Real" Worst-case WCET estimateBest-case
timeTesting not enough Pessimistic analysis
(c) Rapita Systems Ltd. 2007 15
Why use WCET (and not Average)?
• Reliability:– If you know the worst case then (by definition) all other cases are
smaller.– Timing problems happen when execution time budgets are
exceed (not on average situations!)• Stability:
– Different runs of test cases will generate different “average” case times, and different maximum measured times.
– Different runs of test cases show stability of computed worst case times.
• Management of timing as code is developed:– Avoid “suddenly” running out of capacity– Identify likely timing overruns earlier in development.
• Optimization:– Show where to optimize (and where not to)– Show how good your optimizations are– Show effect of optimization on other parts of code
(c) Rapita Systems Ltd. 2007 16
Worst/average case optimisationAverage case optimization:
if ( test1() ) /* Most frequent case */ short_function();elseif ( test2() ) test2_function();elseif ( test3() ); long_function();
Worst case optimization:
if ( test3() ) /* worst case */ long_function();elseif ( test1() ); short_function();elseif ( test2() ) test2_function();
•Increases "average“ (most frequent) execution time
•Decreases time on worst case path
(c) Rapita Systems Ltd. 2007 17
How do you know the WCET?
• Structural behaviour (high level)– What is the worst path through the code?– Maximum iterations of loops and loop interactions– Input data dependencies– History dependent / data dependent– Compiler transformations and optimisations
• Hardware behaviour (low level)– What is the execution time of an instruction?– What is the execution time of the longest path?– Depends on where things are on memory (RAM, FLASH,
ROM,…)– Cache and MMU address translation units– Contention accessing external buses
• Embedded programs are "well behaved and simple" (but not always).• Difficult to analyse automatically generated code
(c) Rapita Systems Ltd. 2007 18
WCET Analysis: approaches– Extensive testing :
• Measuring end to end execution time of the program• High level:
(–) No knowledge of the structure of the program(–) Not sure that the longest path has been tested(–) Expensive, exhaustive testing not feasible in practice
• Low Level:(+) Observes real hardware and software(+) No knowledge/assumptions of the hardware needed
– Static WCET • Analysis of the program without running it. Timing model of CPU.• High-level:
(+) finds the longest path + program structure• Low-level:
(–) HW is (very) difficult to model and validate (DSP, MCU, buses, etc)
(–) Pessimistic (simplistic?)(–) No information on how likely the worst-case behaviour is
– RapiTime Solution: Hybrid approach: • Measurement Based WCET analysis …
(c) Rapita Systems Ltd. 2007 19
WCET Analysis: approaches (2)– Measurement–based WCET analysis: RapiTime
• “The best model of a system is the system itself”
– Combines best features of both approaches • High-level:
(+) Use static analysis techniques to find the longest path through a program
• Low-level:
(+) Determines timings from actual measurements
(+) Full distributions of execution times
(+) No “model” approximations!
(+) Easy and quick to port
– Integration of functional testing and timing analysis• Enables comparing WC measured with WC computed
– Automatic tool support
WCET analysis techniques Overview
• High level analysis– Structure of the program– Determining “flow facts”
• Determining loop bounds• Mutually exclusive paths
– Help by the programmer through code annotations
• Low-level analysis– Execution times of individual machine instructions– Analysis of the interaction of the features of the processor
(cache, pipeline, btb, ...)
Overview of WCET approaches
Measuring timing at basic block level.
Hybrid
Pipeline modelling, cache modelling, branch target buffers, out of order execution, contention to buses, and their interaction…
Tree based (timing schema),Path based,IPET based
+ Flow facts
Purely Static
High Level: path analysis
End-to end measurements N/ABlack box testing
Low Level: HW analysis
(HW) Low-level analysis• Simplistic mode: each instruction takes always the same time to execute
• Simple basic block WCET : WCET(instruction)
• Relies on knowing the WCET of each instruction
– Just read the CPU manual
– Pipelined execution : WCET(Basic Block) < WCET(instr.)
– Caches, Branch Pred, etc. : ET(instruction) is not constant
– Difficult to predict *other* effects
• Interaction of other devices on the board, Memory latencies
• Reducing the pessimism
– Taking into account these HW features
Requires a timing model of each HW feature + their interaction
Proven to be NP-Hard
Σinstruction ∈ BB
Σinstruction ∈ BB
HW modeling problems
• Processor timing model
– Requires a good knowledge of the processor
• Information may be difficult to obtain
• Errors in processor documentation
• Undocumented non-functional behaviour
– Leads to a very complex mathematical model when taking into account several HW features together
– Interaction of the features. Intractable problem
– WCET analysis tools very dependent of the processor
Example Pipeline Analysis• Using “reservation tables”• Timing of each instruction on the different stages of the pipeline• Aim: identify the reservation table of the “sequential” composition, • Extendable to other operations on CFG (join, split)
Images from Peter Puschner, Raimund Kirner. TUVienna
Modeling Caches• Needs a good understanding of cache Architecture
– Types of caches• Instruction cache, data cache, unified
– Cache size, and multiple levels of cache– Memory access times
• Bus contentions• Burst fetches
– Cache layout• Direct mapped, n-way associative
– Replacement strategies• LRU, pseudo round-robin, …
Example execution in cache
00000148 <send_message>: 148: e52de004 str lr, [sp, -#4]! 14c: e3a01c7f mov r1, #32512 ; 0x7f00 150: e3a0c000 mov ip, #0 ; 0x0 154: e28130fb add r3, r1, #251 ; 0xfb 158: e58c3000 str r3, [ip] 15c: e59f3034 ldr r3, [pc, #52] ; 198 <send_message+0x50> 160: e59c2000 ldr r2, [ip] 164: e5830000 str r0, [r3] 168: e58c2000 str r2, [ip] 16c: e28110fa add r1, r1, #250 ; 0xfa 170: e5d0e001 ldrb lr, [r0, #1] 174: e59f3020 ldr r3, [pc, #32] ; 19c <send_message+0x54> 178: e58c1000 str r1, [ip] 17c: e59c0000 ldr r0, [ip] 180: e5c3e000 strb lr, [r3] 184: e59f3014 ldr r3, [pc, #20] ; 1a0 <send_message+0x58> 188: e3a02001 mov r2, #1 ; 0x1 18c: e5c32000 strb r2, [r3] 190: e58c0000 str r0, [ip] 194: e49df004 ldr pc, [sp], #4 ...
voidsend_message( Uint8 * msg ){ tx_buff_ptr = msg; tx_buff_len = msg[LEN]; tx_enable = 1;}
Example impact of layout in memory
void foo( ... ){ int i; for (i=0;i<N;i++) { A(); B(); C(); if(X())
D(); E(); }}
• Assume Cache with Associativity of 4• Function foo calls 5 functions (A-Z) in a loop• The relative position of the code in memory can have a dramatic impact on execution time• This is known as a “Cache killer pattern”• In the best case, A-Z have only one cache miss (in other iterations, they are all loaded in the cache)• In the worst case, every iteration is a cache miss for all modules• The condition (X()), makes this very difficult to find by “testing”. •
Cache address mapping on AT697–2
A div 8 KB
Amod8 KB
I-cache load(blocks/set)
A
C
D
1 2 3 4 5 6 7
E
B
0
Load ≤ 4: all hits
All memory
A EB
C D
Load > 4: 100% misses in these sets
A div 8 KB 1 2 3 4 5 6 70
I-cache load(blocks/set)
Amod8 KB
All memory
(a) Good layout (b) Bad layout
All memory All memory
Addr div 8 KB Addr div 8 KB
“Good” layout
Addrmod8 KB
Addrmod8 KB
I-cache load(blocks/sets)
I-cache load(blocks/sets)
Load ≤ 4 : All hits Load > 4 : 100% misses in these sets
“Bad” layout
Modelling caches (2)• Aim of cache analysis:
– Determine at every memory reference• Always Miss (AM)• Always Hit (AH)• Globally persistent (GP) : First access is “NC”all subsequent
ones are AH• Locally persistent (LP): First access in a “context” is “NC”, all
subsequent ones are AH• Not classified (NC). May be a cache hit, or a cache miss. For
safe approximation, assume all NC are cache misses– Calculation using IPET and Data flow equation DFA models.
– Secondary issue: • Cache conflicts• Optimise memory code layout
Timing Anomalies• When a cache miss results in “shorter” execution time than when
there is a cache hit!• Makes problem of static low-level wcet analysis very difficult• Example 1: due to instruction reordering
Timing Anomalies in Dynamically Scheduled Microprocessors, Thomas Lundqvist, Per Stenstrom, RTSS 1999
D can start one cycle earlier
Timing Anomalies (2)• Example 2: Motorola coldfire MCF 5307
• Example: A cache miss may actually speed up program execution. – MCF 5307 has a united cache and the fetch and execute
pipelines are independent,– A data access hitting in the cache is served directly from the
cache. – At the same time, the fetch pipeline fetches another instruction
block from main memory performing a branch missprediction and replacing two lines of data in the cache.
– Let's assume that they would have been reused later on, but since they were removed cause two misses.
– If the data access were a cache miss, the instruction fetch pipeline would not have fetched these two lines, because the execution pipeline would have resolved the missprediction before those lines were fetched
High Level Analysis• Problem formulation
– Given the execution time of small sections of code in the program– Find the WCET of the longest path
• May not find the “actual” path• To consider:
– Loop bounds, Infeasible paths, Mutually exclusive paths, Contexts of execution, Data dependent parameter passing
• May need to be guided with user annotations
• Object code level– May require heavy user support to help the analysis (published case
study: 5 months of effort to analyse non-interruptible sections of code of an RTOS)
– Analysis at the object code level may be very difficult.– Captures the code generated by the compiler– Difficult to trace back to source
• Source code level– Structure present in the program
High Level Analysis approachesProblem:
– Determine the longest path through a program. – Input : Control flow graph, with variations and annotations
of timing information• Issues:
– Flow fact determination. Constraints on possible paths: e.g. Mutually exclussive paths
– Call contexts. Different calls may not always take the same execution time.
• Due to data inputs (e.g. memcpy() of different sizes). • Or due to HW effects (e.g. Second time of a call may
be in cache)• Calculation
– Path-based approaches– Tree based approaches (timing schema)– IPET: Implicit path enumeration techniques
Path based• Enumerates all paths on the CFG• Finds the longest one based on information of the timing of
basic blocks• Very time consuming. Exponential –time algorithm.
`
Tree based• Build a hierarchical representation of the program from CFG• Timing derived for the leafs of the tree• “Timing schema” to compute WCET of inner nodes• Can handle very large programs• Non-Exponential time analysis algorithm• Program Representations : syntax tree with :
– Leaf nodes: basic blocks– Inner nodes: Sequence, loop, conditional, calls
IPET based• Formulate the WCET problem as a linear maximisation problem.• Solve it using standard linear programming solvers (lpsolve)• Allows interesting restrictions to be described• Execution time can still be prohibitive
(c) Rapita Systems Ltd. 2007 38
RapiTime “minimal example”... /* 10 us*/if( a ){ f1(); /* 20us */}else{ f2(); /* 50us */}... /* 10 us*/if( b ){ f3(); /* 60us */}else{ f4(); /* 5us */}... /* 10 us*/
(c) Rapita Systems Ltd. 2007 39
RapiTime “minimal example”
• Tested, the “green” and “blue” paths• Maximum measured is 110,• RapiTime builds the structure of the program
– Abstract syntax tree– Not tested all possible paths– But worst path computed “statically”
• And determines that the “red” path is possible, and computes WCET of 140.
• NOTE!: 100% MC/DC coverage – But not enough for timing!
20 (f1) 50 (f2)
10
60 (f3) 5 (f4)
10
10
110 85140
(c) Rapita Systems Ltd. 2007 40
RapiTime “minimal example”
... /* 10 us*/if( a ){ f1(); /* 20us */}else{ f2(); /* 50us */}... /*10 us*/if( b ){ f3(); /* 60us */}else{ f4(); /* 5us */}... /* 10 us*/
RPT_Ipoint ( 9);
RPT_Ipoint( 10) ; { } RPT_Ipoint( 11) ; { } RPT_Ipoint( 12) ;
RPT_Ipoint( 13) ; { } RPT_Ipoint( 14) ; { } RPT_Ipoint(15);RPT_Ipoint(16);
Example traceIpoint Timestamp
9 0 ps10 10 us (+10)12 30 us (+20)13 40 us (+10)15 100 us (+60)16 110 us (+10)
(c) Rapita Systems Ltd. 2007 41
RapiTime: how it works
● RapiTime: Software tool that computes the longest time a piece of software (process) can take to run – worst-case execution time: WCET
– on-target profiler for the worst case
● Procedure:1. Run tests on target system
– Use existing test cases if possible
2. Produce a trace of execution
– Capture by logic analyser/ETM etc)
3. RapiTime builds a report
(c) Rapita Systems Ltd. 2007 42
RapiTime Analysis Process
Standard build processCompile linkSources RunExecutable
Tests
+ instrument + extract + collect trace
Traces
WCETAnalysis
WCET reports
Ada GNAT GreenHills Ada Aonix
C GCC tasking dcc Cosmic GreenHills IAR
(c) Rapita Systems Ltd. 2007 43
Tracing Mechanism
● Four general architectures to produce traces:1. Software only (low cost, high overhead)
– Instrumentation points write to memory buffer
– Dump buffer via debugger or serial etc.
2. Software with hardware support
– Instrumentation points write to IO port (etc)
– Logic Analyser or TraceBox timestamps data
3. Full hardware (transparent)
– In-circuit emulator, ETM or Nexus trace etc
– Debugger provides timestamped trace (branch points)
4. Simulated (high cost, low overlead, transparent)
– Software simulator provides a full program counter trace
● Some options best with instrumentation, some without
(c) Rapita Systems Ltd. 2007 44
Instrumentation example
● Instrumentation is automatic
● Configurable:– lots of instrumentation: good report– little instrumentation: less detailed report, but lower overhead
if (a) { c=b();}return f(c+d);
if (a) { RPT_Ipoint(15);c=b();
}{ int tmp=f(c+d); RPT_Ipoint(16); return tmp;}
cins
(c) Rapita Systems Ltd. 2007 45
Example tracing
• Write a number to a memory location– Down to 1 or 2 assembler instructions
• External hardware does tracing, for example:– Debugger:
• Lauterbach Trace32, iSystems, American Arium, ...– Logic analyser:
• Tektronix, Agilent– Specialised TraceBox
• Continuous tracing
/* Write to address */#define RPT_Ipoint(I) (void)((*(volatile unsigned short *)PORT_ADDR) = ( I ))
RapiTime Report
(c) Rapita Systems Ltd. 2007 47
Example: profile
Profile of a small section of code
(c) Rapita Systems Ltd. 2007 48
• Analysis of Mission Computer • Rack of several MPC7410, at 500MHz• Written in Ada, 25 SW partitions • Several 100K LOCS
• Key objective: – Optimise overall execution by 10%– 4 partitions analysed (25% of schedule)
• WCET hotspots– RapiTime identified key WCET hotspots and provided quantification of
optimisation opportunities• Main optimisations done:
– Removing code that created redundant copies of large data structures– Rewriting bit-unpacking code– Enabling compiler to generate much more efficient code (called 700 times)– Rewriting case statements using look-up tables (called 450 times)
• Results– 23% reduction in WCET – By only examining 1.2% of the code
Experience Report: BAE Systems Hawk Jet Trainer
(c) Rapita Systems Ltd. 2007 49
5. Optimization
(c) Rapita Systems Ltd. 2007 50
Optimization
• Optimization is a compromise:– Time– Space– Maintainability– Effort
• At what level can you optimize?– Instruction level (compiler)– Function level (algorithms)– Design level (architecture)
(c) Rapita Systems Ltd. 2007 51
Why you should not “optimize”!
• Maintainability:– optimised code can be hard to read and maintain– was the loss in maintainability worth it?
• No improvement:– are you sure that your code is faster? – On the worst-case path?
• Compiler optimizations:– compilers often faster code if you don't try to be “clever” writing it– But can generate very nasty code from “innocent” source
• hard to predict by inspection of code
(c) Rapita Systems Ltd. 2007 52
Why you have to optimize
• Because you have to improve the code performance....
(c) Rapita Systems Ltd. 2007 53
How to Optimize• Know where to optimize
– high-level first: why do you call a function that many times?– low-level: reduce the execution time of that function
• Know how much time you could save – Prioritisation
• Know how much time you actually save– Worst-case path may switch!– Evidence– Roll back optimizations if the benefit is not worth the loss in
maintainable code
• Understand the worst case behaviour• Understand compiler and target-specific optimizations
• Target: Optimize for maximum improvement with minimum changes.
(c) Rapita Systems Ltd. 2007 54
Compiler-related Hotspots
• Today's compilers are powerful tools– path from high-level language to machine code is complex
• Compilers can do good optimizations– Compilers know about high-level language features, don't try
to use low-level hacks for optimization• e.g. a[i] is often faster than *(a+i)
– Using appropriate compiler switches can make significant difference to execution time, without modifying code.
• Compilers can add heavyweight code that isn't in the source code, hidden code– copying data, – setting up argument lists
• Analysis of program, running on target is the best way to find hidden execution time.
(c) Rapita Systems Ltd. 2007 55
Examples of low-level optimization• Cache locking
– Lock code in the WCET path, not necessarily mostly used path• Data widths and alignments
– Using non-native sizes adds code size/time• Signed/unsigned operations• Multiply and divide can be slow on some architectures
• Inline functions• Compiler pragma/options• Know the processor:
– Optimizations for one processor may result in long execution times on another
(c) Rapita Systems Ltd. 2007 56
Function Level Optimization
• Rewriting functions using techniques such as:– loop unrolling– avoiding extra copies– removing unnecessary checking– lookup tables– data caching– iteration/traversal of data structures– loop jamming (joining loops)– many optimizations are based on moving to single path code– Optimal algorithm complexity. Use the right algorithm for the
task!
(c) Rapita Systems Ltd. 2007 57
Single Path Code
• We are interested in – reducing the worst case execution time – Reducing variability of execution time.
• Example: loop unroll– unroll loop to maximum number of iterations– “average” execution time may increase because we never
exit the loop early– worst case decreases because no tests or branches
• However: must consider:– Is the change really an optimization on a specific target?– Is the optimization worth the extra code overhead and
readability?• Use tool support to quantify impact on timing.
(c) Rapita Systems Ltd. 2007 58
Loop unroll to single path
Loop can execute a maximum of 4 times.for (i=0; i<length; i++){
dest[i]=src[i];}
Unroll to 4 times anyway, but WCET can be shorter because fewer tests and branches.
dest[0]=src[0];dest[1]=src[1];dest[2]=src[2];dest[3]=src[3];
(c) Rapita Systems Ltd. 2007 59
Design Optimizations
• High-level optimizations can give the most benefit (but perhaps at the highest cost)
• Communication model– synchronization, scheduling– Data flows
• Software reuse:– Generic code, may have more functionality than you need.– Unnecessary levels of abstraction
(c) Rapita Systems Ltd. 2007 60
Tool support
• Timing and optimization are not guesswork (any more)– Avoid manual work of instrumenting/measuring– Powerful analysis and reporting– In a large project you need a floodlight, not a candle.
• Tool support tells you:– Where are the worst case hotspots– How much you can save– How much you actually save (or loose!)
Conclusions• Determining WCET is essential for proving timing properties• WCET is an “upper bound”
– Tasks have variability on execution times• Various approaches for calculating WCET with pros and cons
– End-to-end testing– Purely static– Hybrid approach
• In any case, detailed knowledge of the application AND the target is essential
Measurement based WCET Analysis
• Measurement based WCET Analysis.– Avoids the low level modeling– Measure small components of the system– Put together these pieces using high-level static analysis
techniques– Combines best features of both worlds
• Testing– But relies on very good testing data (garbage in, garbage out)– In any case it is “better” than end-to-end testing
• Process– Code instrumentation– Running the program under “good” test environment– Collection of trace data– Analysis as input to high-level analysis