Sabotage-Tolerance Mechanisms for Volunteer Computing Systems

Project BayanihanMIT LCS and Ateneo de Manila University

Luis F. G. Sarmenta. CCGrid 2001, 5/15/2001. slide 1

Sabotage-Tolerance Mechanisms for

Volunteer Computing Systems

Luis F. G. SarmentaAteneo de Manila University, Philippines

(formerly MIT LCS)



Volunteer Computing• Idea: Make it very easy for

even non-expert users to join a NOW by themselves

• Minimal setup requirements Maximum participation

• Very large NOWs very quickly!– just invite people– SETI@home,

distributed.net, others• The Dream:

Electronic Bayanihan– achieving the impossible

through cooperation “Bayanihan” Mural by Carlos “Botong” Francisco, commissioned by Unilab, Philippines. Used with

permission.



The Problem• Allowing anyone to join means possibility

of malicious attacks• Sabotage

– bad data from malicious volunteers• Traditional Approach

– Encryption works against spoofing by outsiders• but not against registered volunteers

– Checksums guard against random faults• but not against saboteurs who disassemble code

• Another Approach: Obfuscation– prevent saboteurs from disassembling code– periodically reobfuscate to avoid disassembly– Promising. But what if we can’t do it, or it doesn’t

work?



Voting and Spot-checking• Assume worst case

– we can’t trust workers, so need to double-check• Voting

– Everything must be done at least m times– Majority wins (like Elections)– e.g., Triple-Modular-Redundancy, NMR, etc.– Problem: not so efficient.

• Spot-checking– Don’t check all the time. Only sometimes.– But if you’re caught, you’re “dead”

• Backtrack – all results of caught saboteur are invalidated• Blacklist – saboteur’s results are ignored

– Scare people into compliance (like Customs at Airport!)– More efficient (?)



Theoretical Analysis: Assumptions

• Master-worker model– eager scheduling lets us redo, undo work– several batches, each batch N works– P workers, fraction f are saboteurs– same speed saboteurs, so work roughly evenly distributed– no spare workers, so higher redundancy (# of work given

out) means worse slowdown (time)• Non-zero acceptable error rate, erracc

– error rate (err) =• average fraction of bad final results in a batch• probability of error of an individual final result

– relatively high for naturally fault-tolerant apps• e.g., image rendering, genetic algorithms, etc.

– correspondingly small for most other apps• e.g., to guarantee 1% failure rate for 1000 works,

erracc = 1% / 1000 = 1e-5



Theoretical Analysis: Assumptions

• Assume saboteurs are Bernoulli processes with independent, identical, constant sabotage rate, s– implies that saboteurs do not agree on when to give

bad answers (unless they always give them)– simplifying assumption, may not be realistic– but may be OK if we assume saboteurs receive works

at different times and cannot distinguish them• Assume saboteurs’ bad answers agree

– allows them to vote (if they happen to give bad answers at the same time)

– pessimistic assumption– we can use crypto and checksums to make it hard to

generate agreeing answers– implies that there are only 2 kinds of answers: bad and

good



Majority Voting• m-majority voting

– m out of 2m-1 must match– used in hardware, and

systems with spare processors– redundancy = 2m-1

• m-first voting– accept as soon as m match– same error rate but faster– redundancy = m/(1-f)

• exponential error rate– err = (cf)m

– where c is between 1.7 and 4• Good for small f, but bad for large f• Minimum redundancy & slowdown of 2

1.E-221.E-201.E-181.E-161.E-141.E-121.E-101.E-081.E-061.E-041.E-021.E+00

1 2 3 4 5 6m

err

2.E-01 1.E-01 1.E-021.E-03 1.E-04



Spot-Checking w/ blacklisting• Lower redundancy

– 1/(1-q)• Good error rates due

to backtracking– no error as long as saboteur is

caught by end of batch• err = sf(1-qs)n

(1-f)+f(1-qs)n

– where n is number of work received in batch (related but a bit more than N/P)

• Saboteurs strategy: only give a few bad answers– s* = 1/(q(n+1))

• Max error, err* < (f/(1-f))(1/qne)• Linear error reduction

according to n– larger batches, better error rates

0

0.005

0.01

0.015

0.02

0.025

0.03

0.035

0.04

0.045

0.05

0 0.2 0.4 0.6 0.8 1s

err

0.7

0.5

0.2

0.1

0.05

0.01

Simulator Results.Note that it workseven if f > 0.5



Spot-Checking w/o blacklisting• What if saboteurs can come

back under new identity?• Saboteur’s strategy:

stay for L turns only• Max error

– err* < f / qLe, if L << n– err* < f / qL, as L -> n– err* < f / qL in all cases

• Linear error reduction according to L, not n– larger batches don’t guarantee

better error rates anymore– L = 1 gives worst errors, err = f(1-q)

• Try to force larger L’s– make forging new ID difficult;

impose sign-on delays– batch-limited blacklisting



Voting and Spot-Checking• Simply running them

together works!• With blacklisting, we can use

spot-checking err rate in placeof f

exponentially reduce linearly-reduced error rate– (qne(1-f))m improvement– big difference! (esp. for large f)

• Unfortunately, doesn’t workas well w/o blacklisting– bad err rate to begin with– substituting err for f doesn’t work

• Problem are saboteurs who come back near end of batch

1.E-07

1.E-06

1.E-05

1.E-04

1.E-03

1.E-02

1.E-01

1.E+000 50 100 150 200 250

l

err

m=2, no SCm=2, l-staym=2, staym=2, th. UBm=2, SC BLm=3,no SCm=3, l-staym=3, staym=3, th. UBm=3, SC BL

1.E-10

1.E-09

1.E-08

1.E-07

1.E-06

1.E-05

1.E-04

1.E-03

1.E-02

1.E-01

1.E+000 0.05 0.1 0.15 0.2 0.25

s

err

f=0.5, m=2f=0.5, m=3f=0.2, m=2f=0.2, m=3f=0.1, m=2f=0.1, m=3



• Problem: errors come from saboteurs who have not yet been spot-checked enough

• Idea: give workers credibility depending on number of spot-checks passed, k

• General Idea: attach credibility values to objects in system• Credibility of X, Cr(X) = probability that X is, or will,

give a good result

Credibility-Based Fault-Tolerance

CredWorkPoolnextUnDoneWork

Work0

CrW = 0.8Work999

CrW= 0.999 Done, res=J

Work1

CrW = 0.492

. . .

Work997

CrW = 0.967. . . Work998

CrW = 0.9992Done, res=Z

θ = 0.999, assuming f 0.2

pidP1

resA

CrG = 0.8CrR0.8

Worker P1

k 0Crp = 0.8

Worker P2

k 6Crp = 0.967

Worker P6

k 6Crp = 0.967

Worker P7

k 125Crp = 0.998

Worker P8

k 3Crp = 0.933

Worker P9

k 200Crp = 0.999

pidP2

resH

CrG = 0.492CrR

0.967pidP6

resB

CrG = 0.492CrR

0.967pidP2

resG

CrG = 0.967CrR

0.967pidres

CrG = 0.9992CrR

P6Z 0.967pidP8

resM

CrG = 0.0008CrR

0.967pidP9

resJ

CrG = 0.999CrR

0.999P7Z 0.998



• 4 types of credibility (in this implementation)– worker, result, result group, work entry

• Credibility Threshold Principle: if we only accept a final result if the conditional probability of it being correct is at least θ, then overall ave. err rate will be at most (1-θ)

• Wait until credibility is high enough

Credibility-Based Fault-Tolerance

CredWorkPoolnextUnDoneWork

Work0

CrW = 0.8Work999

CrW= 0.999 Done, res=J

Work1

CrW = 0.492

. . .

Work997

CrW = 0.967. . . Work998

CrW = 0.9992Done, res=Z

θ = 0.999, assuming f 0.2

pidP1

resA

CrG = 0.8CrR0.8

Worker P1

k 0Crp = 0.8

Worker P2

k 6Crp = 0.967

Worker P6

k 6Crp = 0.967

Worker P7

k 125Crp = 0.998

Worker P8

k 3Crp = 0.933

Worker P9

k 200Crp = 0.999

pidP2

resH

CrG = 0.492CrR

0.967pidP6

resB

CrG = 0.492CrR

0.967pidP2

resG

CrG = 0.967CrR

0.967pidres

CrG = 0.9992CrR

P6Z 0.967pidP8

resM

CrG = 0.0008CrR

0.967pidP9

resJ

CrG = 0.999CrR

0.999P7Z 0.998



Computing Credibility• Worker, CrP(P)

– dubiosity (1-Cr) decreases linearly with # of spot-checks passed, k– CrP(P) = 1 – f, without spot-checking– CrP(P) = 1 – f / ke(1-f), with spot-checking and blacklisting– CrP(P) = 1 – f / k, with spot-checking without blacklisting

• Result, CrR(R)– taken from CrP(R.solver)

• Result Group, CrG(G)– generally increases as # of matching good-credibility results

increase– conditional probability given other groups, and CrR of results– CrG(Ga) = P(Ga good)P(all others bad)

P(getting the groups we got)– e.g., if CrR(R) = 1 – f, for all R, and only 2 groups

CrG = (1-f)m1 fm2 / ((1-f)m1 fm2 + fm1 (1-f)m2 + fm1fm2 )• Work Entry, CrW(W)

– CrG(G) of best group



Results: Credibility w/ blacklisting• N=10000, P=200,

f = 0.2, 0.1, 0.05q = 0.1batch-limited blacklisting

• Note that error never goes above threshold

• Trade-off is in slowdown• Slowdown / err ratio is

very good– each additional repetition

gives > 100x improvement in error rate

1.E-07

1.E-06

1.E-05

1.E-04

1.E-03

1.E-02

1.E-010 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

s

err

0.99 0.999 0.9999 0.99999 0.999999

1

1.5

2

2.5

3

3.5

4

4.5

5

1.E-071.E-061.E-051.E-041.E-031.E-021.E-011.E+00err

slowdown

0.20.10.05



Results: Credibility w/o blacklisting• Error still never goes above

threshold• A bit slower• immune to short-staying

saboteurs– encourages longer stay

1.E-07

1.E-06

1.E-05

1.E-04

1.E-03

1.E-02

1.E-010 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

s

err

0.99 0.999 0.9999 0.99999 0.999999

1

1.5

2

2.5

3

3.5

4

4.5

5

1.E-071.E-061.E-051.E-041.E-031.E-021.E-011.E+00err

slowdown 0.2

0.10.050.01

0.E+00

5.E-06

1.E-05

2.E-05

2.E-05

3.E-05

3.E-05

0 50 100 150 200 250

l

err

stay for l or until caught stay until caught



Results: Using Voting to Spot-check• Normally, spot-check rate is low

because it implies overhead• We can use cred-based voting to

spot-check since cred-based voting has guaranteed low err

• if redundancy >= 2, then effectively, q = 1

• Saboteurs get caught quickly -> low error rates

• Good workers gain high credibility by passing a lot -> reach threshold faster

• Very good slowdown to err slope– about 3 orders-of-magnitude per

extra redundancy– good for non-fault-tolerant apps

1.E-07

1.E-06

1.E-05

1.E-04

1.E-03

1.E-02

1.E-010 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

s

err

0.99 0.9999 0.999999

1

1.5

2

2.5

3

3.5

4

4.5

5

1.E-071.E-061.E-051.E-041.E-031.E-021.E-011.E+00err

slowdown

0.50.20.1



Slowdown vs.Err

1

1.5

2

2.5

3

3.5

4

4.5

5

1.E-071.E-061.E-051.E-041.E-031.E-021.E-011.E+00err

slowdown

0.50.20.1

1

1.5

2

2.5

3

3.5

4

4.5

5

1.E-071.E-061.E-051.E-041.E-031.E-021.E-011.E+00err

slowdown 0.2

0.10.050.01

1

1.5

2

2.5

3

3.5

4

4.5

5

1.E-071.E-061.E-051.E-041.E-031.E-021.E-011.E+00err

slowdown

0.20.10.05

1

1.5

2

2.5

3

3.5

4

4.5

5

1.E-071.E-061.E-051.E-041.E-031.E-021.E-011.E+00err

slowdown

0.20.150.10.050.01

voting only cred, w/ SC & BL

cred, w/ SC, w/o BL cred, using Voting for SC, w/o BL

At f=20%,for the same slowdown,

cred w/ V-SC, w/o BLgets 10^5 times better

err rate thanm-first majority voting!



Variations• Credibility-based fault-tolerance is highly-

generalizable• Credibility Threshold Principle holds in all cases

– provided that we compute conditional probability correctly

• Change in assumptions and implementations lead to change in credibility metrics, e.g.,– if we assume saboteurs communicate, then change

result group credibility– if we have trustable hosts, or untrustable domains,

adjust worker cred. accordingly– if we can use checksums, encryption, obfuscation,

etc., then adjust CrP, CrG, etc.– time-varying credibility– compute credibility of batches or work pools



Summary of Mechanisms• Voting

– error reduction exponential with redundancy– but bad for large f– minimum redundancy of 2

• Spot-checking with backtracking and blacklisting– error reduction linear with work done by each volunteer– lower redundancy– good for large f

• Voting and Spot-checking– exponentially reduce linearly-reduced error rate

• Credibility-based Fault-Tolerance– guarantee limit on error by watching conditional prob.– automatically combines voting and spot-checking as

necessary– more efficient than simple voting and spot-checking– open to variations



For more information• Recently finished Ph.D. thesis

– Volunteer Computing by Luis F. G. Sarmenta, MIT.

• This, and other papers available from:– http://www.cag.lcs.mit.edu/bayanihan/

• Paper at IC 2001 (w/ PDPTA 2001)Las Vegas, June 25-28 – more on how we parallelized the simulation– details are also in thesis

• Email:– [email protected] or [email protected]

http://www.cag.lcs.mit.edu/bayanihan/

Documents

Sabotage-Tolerance Mechanisms for Volunteer Computing Systems