Service d’échantillonage uniforme résiliant aux

HAL Id: hal-00985631https://hal.archives-ouvertes.fr/hal-00985631

Submitted on 30 Apr 2014

HAL is a multi-disciplinary open accessarchive for the deposit and dissemination of sci-entific research documents, whether they are pub-lished or not. The documents may come fromteaching and research institutions in France orabroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, estdestinée au dépôt et à la diffusion de documentsscientifiques de niveau recherche, publiés ou non,émanant des établissements d’enseignement et derecherche français ou étrangers, des laboratoirespublics ou privés.

Service d’échantillonage uniforme résiliant auxcomportments malveillants

Emmanuelle Anceaume, yann Busnel, Bruno Sericola

To cite this version:Emmanuelle Anceaume, yann Busnel, Bruno Sericola. Service d’échantillonage uniforme résiliant auxcomportments malveillants. ALGOTEL 2014 – 16èmes Rencontres Francophones sur les AspectsAlgorithmiques des Télécommunications, Jun 2014, France. pp.1–4. hal-00985631

https://hal.archives-ouvertes.fr/hal-00985631

https://hal.archives-ouvertes.fr

Service d’échantillonage uniforme résiliant

aux comportments malveillants

Emmanuelle Anceaume1, Yann Busnel2, Bruno Sericola3

1IRISA & CNRS, Campus Universitaire de Beaulieu, 35042 Rennes Cedex, France2LINA & Université de Nantes, 2 rue de la Houssinière, BP 92208, 44322 Nantes Cedex 03, France3INRIA Rennes – Bretagne Atlantique, Campus Universitaire de Beaulieu, 35042 Rennes Cedex, France

Nous proposons une solution au problème d’échantillonage uniforme dans les systèmes à grande échelle en présence de

comportements byzantins. Notre premier algorithme permet d’uniformiser à la volée un flux de données (items) de taille

non bornée, sous l’hypothèse que les probabilités exactes d’occurrence des items sont connues. Nous modélisons le

comportement de notre algorithme par une chaîne de Markov dont nous étudions le régime stationnaire et le transitoire.

Notre second algorithme relache l’hypothèse de connaissance de la probabilité d’occurrence des items dans le flux

initial. Ces probabilités sont estimées à la volée en utilisant un espace mémoire logarithmique en la taille du flux. Nous

évaluons la résilience de cet algorithme face à des attaques ciblées et par innondation. Nous quantifions l’effort que doit

fournir l’adversaire (i.e., nombre d’items à injecter dans le flux initial) pour violer la propriété d’uniformité.

Keywords: Echantillonnage uniforme, flux de données, adversaire byzantin, algorithme d’approximation probabiliste.

1 Introduction

The uniform node sampling service offers a single simple primitive to applications using it, which returns

the identifier of a random node that belongs to the system. Providing at any time randomly chosen nodes in

the system has deserved a lot of attention to construct large scale distributed applications. Node sampling

is a cooperative service in the sense that all the nodes of the system contribute to this service by conti-

nuously sending and forwarding information about their presence. Unfortunately, the unavoidable presence

of malicious nodes in large scale and open systems seriously impedes the construction of uniform node

sampling. The objective of malicious nodes mainly consists in continuously and largely biasing the input

data stream out of which samples are obtained, to prevent (correct) nodes from being selected as samples.

Consequences of these collective attacks (also called Sybil attacks) are, among others, the overwhelming

load of some specific nodes when it is used to provide random locations for data caching or storage, or the

eventual partitioning of the system when the node sampling service is used to build nodes local views in

epidemic-based protocols. Solutions that basically consist in storing the identifier of all the nodes of the sys-

tem so that each of these node identifiers can be randomly selected when needed are impracticable and even

infeasible due to the size and the dynamicity of such networks. Rather providing a solution that requires as

little space as possible (e.g., sublinear in the population size of the system) is definitely desirable. Bortnikov

et al. [BGK+09] have recently proposed a uniform node sampling algorithm that tolerates malicious nodes

by exploiting the properties offered by min-wise permutations. The sampling component outputs the node

identifier whose image value under the randomly chosen permutation is the smallest value ever encounte-

red. Thus eventually, by the property of min-wise permutation, the sampler converges towards a random

sample. However by the very same properties of min-wise permutation functions, once the convergence has

been reached, it is stuck to this convergence value independently from any subsequent input values. Thus

the sample does not evolve according to the current composition of the system, which makes it static.

In this paper, we address this problem by first proposing an omniscient algorithm capable of tolerating

any bias introduced by the adversary in the input stream. By omniscient we mean that the algorithm knows

Emmanuelle Anceaume, Yann Busnel, Bruno Sericola

the number of occurrences of each received element in the full input stream. We analyze the stationary

and transient behaviour of this algorithm through a Markov chain analysis. We then propose a randomized

approximation algorithm capable of outputting an unbiased and non static sample of the population whate-

ver the strategy of the adversary is. This sample may deviate from an exact uniform sample, however the

deviation is bounded with any tunable probability. This algorithm is a one-pass algorithm and only com-

pact synopses or sketches that contain the most important information about data items are locally stored.

This algorithm does not require any a priori knowledge neither on the size of the input stream, nor on the

number of distinct elements that compose it, nor on the frequency distribution of these elements. We then

evaluate the minimum effort that needs to be exerted by a strong adversary to bias the output stream when

two representative attacks are launched, i.e., the targeted attacks in which the adversary focuses on biasing

the frequency of a single node identifier, and the flooding attack which aims at biasing all the node identi-

fiers frequencies. One of the main results of this analysis is the fact that the effort that needs to be exerted

by the adversary to subvert the sampling service can be made arbitrarily large by any correct node by just

increasing the memory space of the sampler. Finally, extensive simulations (both on real data and synthetic

traces) confirm the robustness of our sampler service. To the best of our knowledge, no previous work has

proposed such an analysis.

2 System model

We consider a large scale and dynamic open system N in which each node i ∈ N receives a very large

stream σi made of node identifiers (also denoted ids). We denote n =| N |. Node identifiers arrive quickly

and sequentially. Each node identifier j of σi is drawn from a set Ω = 1, . . . ,2r, where r is chosen to be

large enough to make the probability of identifier collision negligible. The number of times a node identifier

j recurs in the stream is called the frequency of j. For memory constraints, nodes can locally store only a

small amount of information with respect to the number of ids in the system. Thus the stream needs to be

processed in an online manner, i.e., any item of the stream that has not been locally stored for any further

processing cannot be read any more. In addition the amount of computation per data element of the stream

must be low to keep pace with the stream.

We assume the presence of malicious nodes that collectively try to subvert the system. We model these

adversarial behaviors through an adversary that fully controls and manipulates these malicious nodes. We

suppose that the adversary is strong in the sense that it may actively tamper with the data stream of any node

i by observing, and inserting any number of malicious nodes identifiers. Indeed, the goal of the adversary is

to judiciously increase the frequency of f chosen node identifiers to bias the sample built by non malicious

nodes. The number f is chosen by the adversary and depends on the sampling protocol parameters. Note

that each malicious node identifier does not need to correspond to a single real node. Indeed, the adversary

will augment its power by generating numerous node identifiers, such that only a limited number of real

malicious nodes are linked to these identifiers. However, affecting multiple identifiers to a single node is

costly as one needs to interact with a central authority to receive a certificate assessing the validity and

integrity of the identifier. A node present in the system that is not malicious is said to be correct. Note

that correct nodes cannot a priori distinguish correct node identifiers from malicious ones. Classically, we

assume that the adversary can neither drop a message exchanged between two correct nodes nor tamper with

its content without being detected. This is achieved by assuming the existence of a signature scheme (and

the corresponding public-key infrastructure) ensuring the authenticity and integrity of messages. This refers

to the authenticated Byzantine failure model. We finally suppose that any algorithm run by any correct node

to build a uniform node sampling service is public knowledge to avoid some kind of security by obscurity.

However the adversary has not access to the local random coins used in the algorithms.

3 Node sampling service tolerant to malicious nodes

3.1 The addressed problem

A node sampling service tolerant to malicious nodes is a functionality local to each correct node i of

the system. Although malicious nodes have also access to a sampling service, we cannot impose any as-

Service d’échantillonage uniforme résiliant aux comportments malveillants

sumptions on how they use it as their behavior can be totally arbitrary. This service continuously reads the

input stream σi received by node i. Data streams are made of the node identifiers exchanged within the

system. Note that the analysis presented in this paper is independent from the way data streams are built.

That is, they may result from the continuous propagation of node ids through gossip-based algorithms, or

from the node ids received during random walks initiated at each node of the system. In addition, the input

stream of any correct node can be arbitrarily biased by an adversary, which is achieved by infinitely often

augmenting it with the f ids it manipulates. The objective of the sampling service strategy is to process on

the fly the input stream and to output a stream guaranteeing both Uniformity and Freshness. Specifically, if

Si(t) denotes the output of the sampling service at any correct node i at any discrete time t, then a sampling

service tolerant to malicious behaviors should meet the following two properties.

Property 3.1 (Uniformity) For any t ≥ 0, for any node id j ∈ N , PSi(t) = j=1

n.

Property 3.2 (Freshness) For any t ≥ 0, for any node id j ∈ N , t ′ > t | Si(t′) = j 6= /0 with probability 1.

Uniformity states that any node in the system should have the same probability to appear in the sample

of correct nodes in the overlay, while Freshness says that any node that recurs infinitely often in the stream,

should have a non-null probability to appear infinitely often in the sample of any correct nodes in the system.

3.2 An omniscient and a knowledge-free one-pass algorithms

This section first presents an omniscient one-pass algorithm that guarantees both the Uniformity and

Freshness properties. By omniscient, we mean that the algorithm knows exactly the occurrence probability

p j of j in the full stream σi (Hypothesis H1). Note however that the algorithm does not know ahead of time

the identifiers that will appear in σi. This knowledge is built on the fly when reading σi. The omniscient

strategy has uniquely access to a data structure Γi, referred to as the sampling memory, whose cardinality of

Γi is constant and is denoted by c with c ≪ n. The sampling memory contains the node ids that are selected

by the strategy when reading σi. Specifically, the omniscient algorithm reads on the fly and sequentially

the input stream and, for each read element j, decides whether j is a good candidate for being stored into

the constant size memory Γi or not. If p j is very small, then j must definitively be stored into Γi so that j

might have a chance to be part of the output stream. On the other hand, with larger p j, there will be other

opportunities for the sampler to receive j in the future. Inserting j into Γi with a well chosen probability

a j is a necessary condition to prevent very frequent ids from continuously eclipsing the ids already stored

in Γi. Although, this is not sufficient to guarantee that a rare id k already stored in Γi will not be evicted

each time a new id j is stored (assuming that Γi is full upon receipt of j). Recall that the goal of the

adversary is to prevent identifiers of correct nodes to uniformly appear in the output stream. A sufficient

condition is achieved by removing k from Γi with probability rk/∑ℓ∈Γirℓ, where r1, . . . ,rn are positive real

numbers. Finally, a random node id k′ is chosen from Γi and written in the output stream (note that k′ is not

removed from Γi). We show in the companion paper [ABS13] that setting a j = q/p j with q=minℓ∈N pℓ and

r j =1c

guarantees that the algorithm converges to a stationary regime where both Uniformity and Freshness

properties hold, and that the time to converge decreases w.r.t. the size c of the sampling memory.

We now show how to extend this algorithm to get rid of hypothesis H1. Clearly such an assumption is

unrealistic since the adversary may modify on the fly the occurrence probability of any node identifier in

the stream by increasing the occurrence frequency of the f node identifiers it manipulates. This extension,

called the knowledge-free algorithm makes no assumption with respect to the input stream σi. For each

received j from σi, it selects the id that will be part of the output stream by solely relying on an estimation

of p j. Both estimations are computed on the fly by using very few space and a small number of operations.

Specifically, the knowledge-free strategy uses one additional data structure with respect to the omniscient

one. This data structure is the Count-Min (CM) Sketch [CM05]. The CM sketch is built on the fly and

provides at any time, and for each j read from σi, an approximation of the number of times j has appeared

in σi from the inception of the stream. The error of the estimator in answering a query for the frequency of

j is within a factor of ε with probability δ. Sketch uses a two-dimensional array F of k× s counters with

k = ⌈e/ε⌉ and s = ⌈log2(1/δ)⌉, and a collection of 2-universal hash functions h1, . . . ,hs. Each time an

item j is read from the input stream, this causes one counter per line to be incremented, i.e., F [v][hv( j)] is

Emmanuelle Anceaume, Yann Busnel, Bruno Sericola

incremented for all v ∈ 1, . . . ,s. When a query is issued to get an estimate of the frequency of j (i.e., the

number of occurrences of j read so far from the stream), the returned value corresponds to the minimum

among the s values of F [v][hv( j)] (v ∈ 1, . . . ,s). The space required by the CM sketch is proportional to1ε

log21δ, and the update time per element is significantly sublinear in the size of the sketch [CM05].

3.3 Performance evaluation

The omniscient algorithm cannot be tampered with any adversary [ABS13]. We have then evaluated the

minimum effort that needs to be exerted by a strong adversary to bias the frequency estimator [CM05],

when two representative attacks are launched, i.e., the targeted attacks in which the adversary focuses on

biasing the frequency of a single node identifier, and the flooding attack which aims at biasing all the

node identifiers frequencies. Both evaluations are conducted by modeling them as an urn problem. One

of the main results of this analysis is the fact that the effort that needs to be exerted by the adversary

to subvert the sampling service can be made arbitrarily large by any correct node by just increasing the

memory space of the sampler. We have implemented both the omniscient and knowledge-free algorithms

and have conducted a series of experiments on different types of streams and for different parameters set-

tings [ABS13]. We have fed our algorithms with both real-world data sets and synthetic traces that are

representative of over-represented (malicious) node identifiers. Due to space constraints, we present some

results that summarize the quality of our algorithms. Figure 1 illustrates the behaviour of both algorithms.

Omniscient strategy 0

200

400

600

800

10000 10,000 20,000 30,000 40,000

0 50 100 150 200 250 300 350 400

Knowledge-free strategy 0

200

400

600

800

1000

0 50 100 150 200 250 300 350 400

Input stream 0

200

400

600

800

1000

0 50 100 150 200 250 300 350 400

FIGURE 1: Frequency distribution as a function of time.

Settings : m = 40,000, n = 1000, c = 15, k = 15, s = 14.

It presents a kind of isopleth in which the hori-

zontal axis shows time, the vertical axis represents

the node identifiers, and the body of the graph de-

picts the frequency of each node identifier. A ligh-

ter color is representative of a very frequent node

identifier. The figure at the top of Figure 1 repre-

sents the frequency of each node identifier in the

input stream of the node sampler. This figure shows

that at the inception of the stream, a few number

of node identifiers have been received in the input

stream which explains the dark color on the left. As

time elapses, the number of received identifiers in-

creases (up to 40,000), and progressively the bias

of the input stream appears : a small number of

identifiers recur with a high frequency equal to 400,

while the frequency of the other node identifiers is

significantly lower. Now the two other figures re-

present the output of the node sampler run with

respectively the knowledge-free strategy and with

the omniscient one. Clearly the omniscient strategy

succeeds in outputting a uniform stream, illustrated

by a color that progressively and uniformly becomes lighter as the number of received identifiers augments.

The knowledge-free strategy is not as performant as the omniscient one, nevertheless it succeeds in signifi-

cantly decreasing the peak of high frequency identifiers with a very small memory w.r.t. the length m of the

input stream (the Count-Min data structure F is a 15×14 array) .

Références[ABS13] E. Anceaume, Y. Busnel, and B. Sericola. Uniform node sampling service robust against collusions of

malicious nodes. In the 43rd Intl Conf. on Dependable Systems and Networks (DSN 2013), 2013.

[BGK+09] E. Bortnikov, M. Gurevich, I. Keidar, G. Kliot, and A. Shraer. Brahms : Byzantine Resilient Random

Membership Sampling. Computer Networks, 53 :2340–2359, 2009.

[CM05] G. Cormode and S. Muthukrishnan. An improved data stream summary : the count-min sketch and its

applications. Journal of Algorithms, 55(1) :58–75, 2005.

Documents

Service d’échantillonage uniforme résiliant aux