George Saad University of New Mexico Department of Computer Science

George SaadUniversity of New Mexico

Department of Computer Science

Selfishness and Malice in Distributed Systems

Selfishness and Malice• Selfishness and malice have negative influence on the

performance of distributed systems.• Selfishness of players in a game can reduce social welfare.• Malicious nodes can seriously disrupt the network.

• In this dissertation, we provide algorithms to address these issues.

Selfishness and Malice• Selfishness (El-Farol game): we characterize

BCE for game of +ve/-ve network effects. “The Power of Mediation in an Extended El Farol Game”, SAGT’13

2013

2013 2014

“Self-Healing Communication”, SSS’13 “Self-Healing Computation”, SSS’14

• Malice: we develop algorithms to recover networks from Byzantine faults.

Part I : Selfishness

El-Farol Game

• A set of n selfish players• Actions:• go to the bar• stay home

• The cost function:• cost to stay = 1,• cost to go: f(x)

Objective: find an equilibrium which minimizes Social Cost, where

Our El Farol Extension

We extend the cost function:• The cost to stay can be any constant t > 0,• The cost to go, f(x):

Positive and Negative Network Effects

“Many real situations in fact display both kinds of [positive and negative] externalities … an on-line social media site with limited infrastructure might be most enjoyable if it has a reasonably large audience, but not so large that connecting to the Web site becomes very slow due to the congestion.”

“Many real situations in fact display both kinds of [positive and negative] externalities … an on-line social media site with limited infrastructure might be most enjoyable if it has a reasonably large audience, but not so large that connecting to the Web site becomes very slow due to the congestion.”

[Easley and Kleinberg, 2010]

Solution ConceptsHow to minimize

my own cost unilaterally?• Nash Equilibrium • Unfortunately, NE has high social cost.

• Correlated Equilibrium (CE)• Mediator implements CE.

Mediator• A trusted coordinator that • gives recommendations to the players, • implements a correlated equilibrium.• Note that all players have free will.

• A mediator is optimal when it implements the best correlated equilibrium.

Let mediator have a probability distribution on k ≥ 1 strategy profiles.• The players know probability distribution and strategy profiles.• Mediator selects secretly one strategy profile according to the

probability distribution. • Mediator advises each player privately and separately.• No player has incentive to deviate unilaterally from the advice.

How to design such a mediator?

( [s11,…,s1n], p1 )

( [s21,…,s2n] , p2 )

( [sk1,…,skn] , pk )

( x1 , p1 )

( x2 , p2 )

( xk , pk )

( x1 , p1 )

( x2 , p2 )

( xk , pk )

( x1 , p1 )

( x2 , p2 )

( xk , pk )

( x1 , p1 )

( x2 , p2 )

( xk , pk )

( x1 , p1 )

( x2 , p2 )

( xk , pk )

Example for (c, s1, s2)-El Farol Game

• For a (2, 4, 4)-El Farol game:• Best Nash Equilibrium:

• ¼-fraction of players go.• Social cost = n.

• An optimal mediator:• Strategy profile 1: (x1 = 0, p1 = 1/3)

• Strategy profile 2: (x2 = ½, p2 = 2/3)• Expected social cost = ⅔ n.

• The optimal social cost (no selfishness)• ½-fraction of players go.• Social cost = ½ n.

How efficient is our mediator?

Our Contributions• Game of positive and negative network effects, we characterize: • Optimal Social Cost,• Best Nash Equilibrium (BNE), and• Best Correlated Equilibrium (BCE).

• Efficiency of optimal mediator for this game• When BCE = BNE?• MV and EV can be unbounded!

Optimal Social Cost

We characterize x* as a function of parameters of our game.

Best Nash Equilibrium

Optimal Mediator

- - p is a function of c, s1 and s2.- p can be 0 or 1 for some values of c, s1 and s2.

When is BCE = BNE?

If c ≤ 1, then all players would rather stay, if f(1) ≥ 1; all players would rather go, if f(1) < 1.

If c > 1 and λ(c, s1, s2) ≥ 1, then all players would rather go, where:

When BCE = BNE?

BCE is advantageous over BNE when c > 1 and λ < 1.

Can MV be unbounded?c s1 s2 c/s1 1

Can EV be unbounded?c s1 s2 c/s1 1

Related Work• Linear Congestion Games [CK’05]:• 1.577 ≤ EV ≤ 1.6 and MV ≤ 1.015.

• Ranking Games [BFHS’07]:• EV = n-1 and MV = n-1 for n>3.

• Virus Inoculation Game [DMNS’09]:• EV = and MV = .

Conclusion

• We extended the El-Farol game to have both positive and negative network effects.

• For this extension, we have characterized:• the optimal social cost, • the BNE, and• the BCE.

• We characterized the MV and the EV for this game.• We show when BCE = BNE.• We show that MV and EV can be unbounded in this game.

Open Problems

• Multi-Site El-Farol Game (> 2 actions): • The bar has k > 2 sites.• Each player chooses which site to go to.• How many strategy profiles required for BCE?

• If f(x) is polynomial in x, with degree > 1, then• what is the characterization of BCE? • Is # strategy profiles related to degree of

f(x)?

Self-Healing Communication Self-Healing Computation

Part II : Malice

Malice• We consider the presence of an adversary.

• Adversary takes over a subset of nodes to cause faults.

• Byzantine Faults vs Fail-Stop Faults

• Fault Tolerance:

• Replication

• Self-healing (automatic recovery)

Fault Tolerance• Non-self-healing algorithms for Byzantine model: [NW’03,

HK’04, FSY’05, AS’06, AJR’06, AS’07, JY’08, GKKY’10, GKKY’13].

• Self-healing algorithms for fail-stop model: [BSAS’06, ST’06, HRST’08, HST’09, PT’11, ST’11].

• Self-healing Algorithms for Byzantine faults?

• We develop self-healing algorithms to recover from Byzantine faults.

How to recover from Byzantine faults?

Self-Healing CommunicationMessage is sent through a path of nodes.

Self-Healing ComputationComputation is performed through circuits.

Our Model• A network of n nodes• Static and Computationally Bounded Adversary• Adversary controls up to ¼ of the nodes.• Partially Synchronous Communication: Upper bound of time

steps between sending and receiving messages.• Rushing Adversary: Waiting until receiving all messages from

good nodes before responding.• After bad nodes selected, Quorum Graph is built up [KLST’10]• Any quorum is a set of θ(log n) nodes; and • Each node is in θ(log n) quorums.• At most ¼ of nodes in any quorum are bad.

KLST’10 : Valerie King, Steve Lonargan, Jared Saia and Amitabh Trehan, “Load balanced Scalable Byzantine Agreement through Quorum Building, with Full Information”, ICDCN 2010.

Naïve Communication (no self-healing)

• All-to-all communication between quorums• Message cost O(l log2 n), and latency O(l)• However, we can do better by self-healing.

Our Contribution• We developed a self-healing algorithm that detects message

corruptions and marks bad nodes.

• Each bad node causes O((log∗ n)2) corruptions, in expectation.“Fool me once, shame on you. Fool me ω((log* n)2) times,

shame on me.”

Iterated Logarithme.g. log*

1010 = 5

Naïve Communication Our Algorithm

Message cost O(l log2 n ) O(l + log n)Latency O(l) O(l)Corruptions No corruptions O(t(log∗ n)2))

Our Algorithm (SEND)

SEND-PATH

SEND

CHECK

CHECK1 CHECK2

HEAL

HEAL is triggered O(t) times before all bad nodes are marked.

CHECK1• SEND triggers CHECK1 with probability 1/(log log n)2.• Subquorum size is O(log log n).• Latency is O(l) and Message Cost is O(l (log log n)2).• Detects corruptions with const prob. for l = O(log2 n).

• SEND triggers CHECK1 with probability 1/(log log n)2.• Subquorum size is O(log log n).• Latency is O(l) and Message Cost is O(l (log log n)2).

CHECK2• SEND triggers CHECK2 with probability 1/(log ∗ n)2.• CHECK2 has O(log ∗ n) rounds.• Incremental subquorum size, up to O(log∗ n).• Latency is O(l log ∗ n) and Message Cost is O(l (log ∗ n)2).

• SEND triggers CHECK2 with probability 1/(log ∗n)2.• CHECK2 has O(log ∗ n) rounds.• Incremental subquorum size, up to O(log∗ n).• Latency is O(l log ∗ n) and Message Cost is O(l (log ∗n)2).

CHECK2 Analysis• Deception Interval : a substring of bad nodes, where a

corruption occurs.• Key Points of Detecting Corruptions:• Deception interval shrinks logarithmically with prob. ≥ ½.• O(log* n) rounds to shrink deception interval to size zero.

CHECK2 Analysis• Deception Interval shrinks logarithmically from round to round:

HEAL

• Inspects each node participated what it received and sent

• Marks the nodes that are in conflict* A pair of nodes is in conflict if they accuse each other

• Each pair of nodes in conflict has at least one bad node

?HEAL

• If ½ nodes in any quorum are marked, they are set unmarked.

• HEAL is triggered O(t) times before all bad nodes are marked.

• We show that using a potential function argument.

• f(b,g) is monotonically increasing,• Δf(b,g) is at least some +ve constant.• When f(b,g) = t, we are done.

Empirical Results• Our simulation runs:• over butterfly networks of quorums,• for different network sizes, up to

n=30k, and • for different fractions of bad nodes.

• Simulation terminates after all bad nodes are marked.

• The results are taken over 3000 experiments.

# messages is improved by a factor of 60 for CHECK1

39,100

649

Empirical Results# Messages reduces by a factor of 60 (n~30k)

39,100

1,177

# messages is improved by a factor of 33 for CHECK2

Empirical ResultsLatency increases by 1½ times (n~30k)

Latency increases by 1½ times for CHECK1

39,100

649

Latency increases by 2 times for CHECK2

18

13

25

13

Empirical ResultsCorruption Probability 0

39,100

649

18

13

25

13

CHECK1 CHECK2

Empirical Results# Messages reduces by O(log2 n) times

Empirical ResultsLatency increases by (1) timesθ

How to recover from Byzantine faults?

Self-Healing CommunicationMessage is sent through a path of nodes.

Self-Healing ComputationComputation is performed through circuits.

Quorum Graph• Quorum Graph has:• n input quorums; • m quorum gates; and• one output quorum

• No self-healing• All nodes in each quorum (gate) perform the same computation• Results are sent between quorums via all-to-all communication• Expensive resource cost

Naïve Computation

Our Contribution

Naïve Computation Our Algorithm

Message cost O( (n+m) log2 n ) O(m + nlog n)

Computation cost O( (n+m) log2 n ) O(m + nlog n)

Latency O(l) O(l)Corruptions No corruptions O(t(log∗ n)2))

We develop a self-healing algorithm for computation networks

Our Algorithm (COMPUTE)

COMPUTE

CHECK

EVALUATE

RECOVER

CHECK Algorithm• CHECK has O(log* n) rounds• In each round, nodes are selected uniformly at random, and same

computation is performed

Round 1

Round 2

CHECK Algorithm• Adversary corrupts computation in a Deception Subgraph.

• Key points of corruption detection:• We prove that deception subgraph shrinks logarithmically in each

round with constant probability.• Once deception subgraph shrinks to size zero, corruption is

detected.

Shrinks Logarithmically

Round 1

Round 2


Round 2

Round 3


Round 3

Round 4

RECOVER

• Inspects each node participated what it received and sent

• Marks the nodes that are in conflict* A pair of nodes is in conflict if they accuse each other

• Each pair of nodes in conflict has at least one bad node

?RECOVER

• If ½ nodes in any quorum are marked, they are set unmarked.

• HEAL is triggered O(t) times before all bad nodes are marked.

• We show that using a potential function argument.

• f(b,g) is monotonically increasing, and• when it reaches t, we are done.

Empirical Results• Our simulation runs:• over perfect binary trees of quorums,• for different network sizes, up to 8k, and • for different fractions of bad nodes.

• Simulation terminates after all leaders are good.

• The results are taken over 3000 experiments.

Empirical Results# Messages reduces by factor of 65 (n~8k)

Reduced by afactor of 651.01M

66M

Empirical ResultsLatency increases by 1.75 times (n~8k)

Increases 1.75 times

63 time steps

36

Empirical ResultsCorruption Probability 0

Empirical Results# Messages reduces by O(log2n) times!

Empirical ResultsLatency increases by (1) timesθ

Conclusion

• We developed self-healing algorithms to recover networks from Byzantine faults.

• Message cost is reduced polylogarithmically in n, compared to non-self-healing algorithms.

• Experiments show that message cost reduced by • Up to a factor of 60 for communication networks• Up to a factor of 65 for computation networks

• For t < n/4, the expected total number of corruptions is O(t(log∗ n)2)

Open Problems• Can we limit the number of corruptions to O(t)?• How to self-heal networks with churn? adaptive adversary?• How to self-healing asynchronous networks?• We trigger CHECK and select the nodes in a centralized

manner. How we make CHECK decentralized?• We propose a decentralized CHECK for future work.• We implement a simulation that suggests interesting results.

Thanks! Any Questions?

Documents

George Saad University of New Mexico Department of Computer Science