Copyright © 2005 EEM202A/CSM213A - Fall 2005 Ram Kumar & Roy Shea UCLA - NESL {ram@ee,roy@cs}.ucla.edu Lecture #12: Reliable Embedded

Copyright © 2005

EEM202A/CSM213A - Fall 2005

Ram Kumar & Roy Shea

UCLA - NESL

{ram@ee,roy@cs}.ucla.edu

http://nesl.ee.ucla.edu

Lecture #12: Reliable Embedded Software

2

Reading List for this Lecture

• “The Model Checker Spin”, IEEE Trans. on Software Engineering, Vol. May 1997.

• D. Gay, P. Levis, R. von Behren, M. Welsh, E. Brewer, and D. Culler.“The nesC Language: A Holistic Approach to Networked Embedded Systems”. Proceedings of Programming Language Design and Implementation (PLDI) 2003, June 2003.

• G. Necula, S. McPeak, S.P. Rahul, and W. Weimer. "CIL: Intermediate Language and Tools for Analysis and Transformation of C Programs".Proceedings of Conference on Compiler Construction, 2002.

• Mark Weiser. “Program Slicing”. 5th International Conference on Software Engineering. 1981.

• Robert Wahbe, Steven Lucco, Thomas E. Anderson, Susan L. Graham, “Efficient software-based fault isolation,” Proceedings of the fourteenth ACM symposium on Operating systems principles (SOSP-93).

– http://citeseer.ist.psu.edu/wahbe93efficient.html• Nial Murphy, “Watchdog Timers,” Embedded Systems Programming

– http://www.embedded.com/2000/0011/0011feat4.htm

http://citeseer.ist.psu.edu/wahbe93efficient.html

http://www.embedded.com/2000/0011/0011feat4.htm

3

Outline

• Overview of design process

• Static analysis– Concurrency

– Memory usage

• Runtime monitoring– Detection

– Isolation

• Hardware support• Conclusions

Implementation(Static analysis)

Deployment(Runtime

monitoring)

TestingSpecification

4

Overview of Software Design Process

• Specification– Understand task and

constraints– Develop formal

models for protocols– “The Model Checker Spin”,

IEEE Trans. on Software Engineering, Vol. May 1997.

• Testing– Feed inputs– Stress test– Long test

● Implementation*– Coding standards

– Code reviews and pair programming

– Static analysis

● Deployment*– Fault detection

– Isolation

– Feedback

5

What and Why of Static Analysis

• “Testing and verification of a system without running the code”

• Specification may not be implemented correctly• Not all errors appear during test runs

– Concurrency problems with timing dependence– Faults under specific system loads

• Complements other techniques• Early detection such as type checking

6

Techniques

• Create abstract model of the program

– Direct reasoning about code is hard

– Basic blocks or AST – G. Necula, S. McPeak, S.P.

Rahul, and W. Weimer. "CIL: Intermediate Language and Tools for Analysis and Transformation of C Programs".Proceedings of Conference on Compiler Construction, 2002.

• Examine the model– Mark Weiser. “Program Slicing”. 5th

International Conference on Software Engineering. 1981.

– Dataflow to track state through a program

#include <stdlib.h>#include <stdio.h>

int main() {

int x; int y;

x = rand() % 10; y = rand() % 9;

if(x>y) { x = x * x / 2; } else { x = y / 2; y = y * x; }

printf("X+Y = %d", x+y); return 0;}

7

Example: Concurrency

• Problem– Shared data can be corrupted

by concurrent accesses

– Concurrency is a problem even without threading (why?)

• Solution– Annotate atomic code blocks

– Infer what must be protected

– Verify protection by looking at code base

• D. Gay, P. Levis, R. von Behren, M. Welsh, E. Brewer, and D. Culler.“The nesC Language: A Holistic Approach to Networked Embedded Systems”. Proceedings of Programming Language Design and Implementation (PLDI) 2003, June 2003.

8

Example: Memory Management

• Problem– Dynamic memory in embedded applications can result in difficult to

understand bugs and strange errors

– Dangling pointers, memory leaks, data corruption

● Important benefits of dynamic memory– Significantly simplify code base

– Dynamic Memory Allocation in Embedded Apps?

– http://ask.slashdot.org/article.pl?sid=05/11/16/2236235&tid=156&tid=201&tid=4

int *p = malloc (sizeof(int)*num);int *q = malloc (sizeof(int)*num*2);int *r = p;...free(r);...if (p[0] == 0) launchMissile();

9

Model for Memory

Formalized by Shane Markstrum

10

Implementation

• Convert module into an AST• Use data flow to track annotated data

__attribute__((sos_claim)) __attribute__((sos_release))

● Must either:

– persistently store data once

– free data

– release data to ownership of another module

● Must not create any persistent references to data before call

● Must treat data as dead after the call

caller -callee -

11

Outline

• Run Time Techniques– Operate during the execution of system– Access to more information than the static analysis tools– Introduce performance overheads

• Fault Isolation– Localize the impact of the fault– Specifically looking at memory corruption faults

• Fault Tolerance– Detect and recover from a fault

• Restore to a known good state• Re-initialize the state

– Specifically looking at hardware/architecture based techniques

12Memory Corruption Fault

Within Single Address Space

• A program is free to access the entire address space• Memory Corruption Fault

– Very easy for a program to corrupt the state of other programs• Desktop/Server CPUs have MMU

– No MMU in Embedded Processors (esp. micro-controllers)– Power, Performance, Cost … blah blah

Middleware

Operating System

ApplicationsRun-time

Stack

Global Data

and

Heap

Single Continuous

Address Space

Single Continuous

Address Space

Program Memory Data Memory

13

Software Fault Isolation (SFI)

• Re-write the program to perform fault isolation in software– Simple but a very powerful concept

– Useful even in servers/desktops for high performance application extensions, kernel extensions etc.

• Trade slower instrumented code for more protection– No need for a hardware protection boundary

• Slogan - You can still shoot yourself in the foot, but you can’t shoot the other guy in the foot

Ack. Prof. Aiken UCB

14

Overview

• Maintain two invariants for isolated code

• Any jumps stay within the isolated code

• Any writes are to data belonging to the isolated code

• Idea: Divide the address-space into segments– Segment addresses have unique high-order bits

• Protection subdomains are defined by segments– Every write must be within the segment

– Every jump must be within the segment

15

Fault Domain

Run-timeStack

Sampling Application

Operating System

Middleware

Operating System

Sampling Application

Fault Domains

No jumps outside fault domainNo writes outside fault domain

PROG DATA

16

Implementation - Segment Matching• Replace each store by the sequence:

dedicated-reg target addressMove target address into dedicated register

srcatch-reg (dedicated-reg >> shift-reg)Right shift address to get segment identifierShift-reg is dedicated

scratch-reg == segment-regCompare segment identifier with current segmentSegment-reg is dedicated

trap if not equalTrap if store address is outside of the segment

store through dedicated-regGuaranteed to store at the correct address

17

Comments

• Segment matching overhead – 4 instructions for EVERY store instruction in the program

• Requires three dedicated registers– Dedicated-reg holds the address being computed– Segment-reg holds current valid segment– Shift-size holds the size of the shift to perform– These three registers will not be used in the program

• Why dedicated registers ?– What will happen if a jump instruction by-passes all

checks ?– What will happen if a jump lands in the middle of the

checks ?

18

Sandboxing - Faster Approach

• Idea– Don’t test the segment bits– Just overwrite segment bits with correct segment

dedicated-reg (target-reg & and-mask-reg)Use dedicated register and-mask-reg to clear segment identifier bits

dedicated-reg dedicated-reg | segment-regUse dedicated register segment-reg to set segment identifier bits

• This is much faster– Only two instructions per instrumentation point

• Loses information about errors– Program may keep running with incorrect instructions and data

19

Implementation Details

• Optimizations– Traditional compiler optimizations

• Move sandboxing out of the loop

– Don’t instrument statically verifiable writes and jumps

• Binary instrumentation– Most portable & easily deployed– Also the hairiest option– Need to verify the binary

• No use of dedicated registers

• Modified compiler– Less easy to adopt– But easier to implement

20

Things to ponder about …

• How will the applications residing in their respective fault domains communicate with one another ?

• How will the data be shared amongst the fault domains?

• How will SFI be implemented on micro-controllers with less than 1 KB of memory ?

21

Embedded Systems In Real World

• Used in inaccessible places– Controllers for space vehicles - MARS Pathfinder– Closer home … sensor networks in dense forests

• Used for critical applications– Brake-by-wire systems– Medical Instruments

• Unexpected faults– Cosmic rays may flip on-chip bits

• Hard or even impossible to produce perfect firmware– Strive to design our systems to cleanly handle failures

22

Watchdog Timer Hardware

• Hardware counter that is set to an initial value• Continually counts down to zero• Responsibility of the software to set the count to original

value• When the counter reaches zero, the software is assumed

to have failed• Perform any suitable recovery

– Typically reset the CPU

• Visual Metaphor– “If the man stops kicking the dog, the dog will take advantage

of the hesitation and bite the man.”

23

Failures detected by watchdog

• Catch events that hang the system

• Transient Failures– Power glitches may corrupt program counter, stack

pointer or even the data in RAM

• Software Bugs– Infinite loops– Accidental jump out of code memory– Deadlock conditions (Incorrect design)

• Watchdog guarantees that none of the bugs will hang the system indefinitely

24

Watchdog Design Considerations• First Aid - Recovery from watchdog bite• Maintain a count of number of resets

– Shutdown a persistently errant application• Use watchdog for sanity checks

– Verify the control flow through a piece of code– Record failure reports in non-volatile storage– Diagnostic information is very useful

• Choosing watchdog timeout interval– Need to understand the timing characteristics of

the program– Large interval - Slow response– Small interval - Frequent resets, difficult to

diagnose• Space Shuttle’s main engine controller

– WDT timeout 18 ms– Switchover to a backup computer

25

Watchdog Self Test

• What if WDT fails in a way that it never bites ?

• Would be discovered only if a failure hangs the system

• WDT failure is VERY EASILY possible– WDT can be disabled in software

– HW Misconfiguration - Jumper of reset line pulled out

• Startup self-test– Allow WDT to timeout and reset the processor

– Flag to distinguish power on reset from WDT reset

26

Grenade Timer

• Idea - Build a counter that cannot be reloaded once it is running– Grenade whose pin has been pulled will have to explode

• Guaranteed reboot is a “useful feature” in some applications– Purges all bad state and re-initializes the system

Grenade Timer HW Interface

27

Taxonomy• FAILURE

– Event that occurs when the delivered service deviates from the correct service

– Failure is the effect that is observed

– E.g. - “Your iPod Nano stops responding.”

• FAULT– Fault is the cause of an error

– An error may lead to failure

– E.g. - “Memory corruption fault lead to the failure of the iPod”

Documents

Copyright © 2005 EEM202A/CSM213A - Fall 2005 Ram Kumar & Roy Shea UCLA - NESL {ram@ee,roy@cs}.ucla.edu Lecture #12: Reliable Embedded