Jordan Adamek Mikhail Nesterenko Sébastien Tixeuil

Evaluating Practical Tolerance Properties of Stabilizing

Programs Through Simulation The Case of Propagation of Information with Feedback

Jordan Adamek

Mikhail Nesterenko

Sébastien Tixeuil

Symposium on Stabilization, Safety, and Security of Distributed Systems

Toronto, Canada October, 2012

2

Why Simulate Stabilization

• Stabilizing program has to recover from an arbitrary system state To prove the algorithm correct, the designer has to focus on stabilization

from degenerate states that are rarely achieved in practice. Such exercise tells little about the algorithm’s practical performance

• Performance evaluations in the area of stabilization are relatively rare. However, they present unique challenges.

What to consider? states: randomization is a common answer. Yet, uniformly randomized

states may be “mild” - evenly distribute process states and may not represent systemic faults

execution models: the model needs to be realistic yet, the results should pertain to the algorithm, not be artifact of the model

parameters: stabilization time is common, yet it often hides the complexity of failure recovery. Other parameters need to be considered.

• We simulate stabilizing PIF and analyze its performance using realistic initial state, three classic execution models and compute a number of stabilization parameters

3

Outline

• PIF algorithmPIF algorithm• parameter selection and parameter selection and

experiment setupexperiment setup• resultsresults• analysisanalysis• conclusionconclusion

4

PIF Algorithm

propagation of information with feedback (PIF)• used to deliver information on rooted trees from root to leaves and get an ack• often considered in stabilization literature; proven ideally- and self- [8,9] as well as

snap-stabilizing [1]

description • each process can be in one of three states: idle (i), requesting (rq), replying (rp)• root initiates a wave by switching from idle to requesting• each intermediate process p propagates request to its children Ch.p• each leaf reflects the wave back by switching from idle to replying• intermediate processes propagate reply back to root• root waits for reply from all children and repeats the cycle

5

Initial State Selection

tree selection• problem: how to select trees that do

not favor particular topology or shape• solution: Prüfer sequence: a sequence

of n-2 labels uniquely defines one ofall possible trees of n-labels random sequence chooses labeled

tree with equal probability

initial state – need to select initial state, then perturb it by fault of varied extent

• problem: not all states occur with equal probabilityex: root is seldom idle

• solution: start from idle state, randomly pick a number from range significantly larger than system size, run the algorithm fault-free that number of states, then induce fault

6

Execution Models & Faults

execution models• problem: execution model should not appear to favor particular system and

or architecture• solution: selected 3 classic well-studied execution semantics

interleaving – randomly execute one enabled action power-set – randomly pick the number X of actions to execute, randomly

pick first, exclude enabled neighbors; continue until X or all enabled actions are selected; execute selected actions

synchronous – same as power-set only continue randomly selecting actions until none remains

faults • randomly pick a process and randomly select its state. Note, may have no

observable effect if fault state is the same as correct state• all processes are faulty – arbitrary initial state: classic stabilization

7

Experiment Setup

• 100 processes avg. tree height 21.64.9 avg. number of leaves 37.53.1

• faults varied from one to 100• ran 1,000 experiments for each fault number

8

Metrics

• stabilization time – number of execution steps for algorithm to achieve legitimate state (a single wave)

• number of actions until stabilization* • overhead – number of action executions outside the

propagation of correct wave (wait time for interleaving semantics [1])

• longest causality chain* – - actions are causally related if executed on same or neighbor process of actions*

• scale – number of processes in the system__ * metrics were not included in published proceedings

9

Stabilization Time

11

Overhead

12

Longest Causality Chain

13

Scale

• interleaving semanitcs

• varied the system size from 100 to 1000 processes

• fixed % of faults (100% is arbitrary state, classic stabilization)

14

Analysis

• simulation results present a detailed picture of algorithm behavior• notes

effort (overhead, actions, time) rises then diminishes with fault extent. In legitimate state single fault may launch spurious wave in opposite direction. Stabilization proportional to system size. Further faults tend to break up this wave and accelerate stabilization

parallel execution semantics (synchronous, power-set) result in greater overhead

15

Future Research & Conclusion

• the study is not exhaustive: the fault location affects the system differently. We believe that the fault closer to the root has a greater ability to perturb the system state

• engagement with practice provides feedback for stabilization research: designers are induced to consider and address the problems of practical import

in our case – space fault spurious “counter-wave” was wholly unexpected – may need algorithmic measures to handle it

16

Thank You

Questions?

Documents

Jordan Adamek Mikhail Nesterenko Sébastien Tixeuil