Privacy-preserving Distributed Movement Data Aggregationhwang4/papers/AGILE_Flows.pdf · e-mail: [email protected] Gennady Andrienko and Natalia Andrienko Fraunhofer

Privacy-preserving Distributed Movement DataAggregation

Anna Monreale, Wendy Hui Wang, Francesca Pratesi, Salvatore Rinzivillo, Dino Pedreschi,Gennady Andrienko, Natalia Andrienko

Abstract We tackle the problem of obtaining general information about vehicle traffic in a cityfrom movement data collected by individual vehicles. An important issue here is the possibleviolation of the privacy of the vehicle users. Movement data are sensitive because they maydescribe typical movement behaviors and therefore be used for re-identification of individualsin a database. We provide a privacy-preserving framework for movement data aggregationbased on trajectory generalization in a distributed environment. The proposed solution, basedon the differential privacy model, provides a formal data protection safeguard. Using real-lifedata, we demonstrate the effectiveness of our approach also in terms of data utility preservedby the data transformation.

1 Introduction

The widespread availability of low cost GPS devices enables collecting data about movementsof people and objects at a large scale. Understanding of the human mobility behavior in a cityis important for improving the use of city space and accessibility of various places and utili-ties, managing the traffic network, and reducing traffic jams. Generalization and aggregationof individual movement data can provide an overall description of traffic flows in a given timeinterval and their variation over time. Paper [2] proposes a method for generalization and ag-gregation of movement data that requires having all individual data in a central station. Thiscentralized setting entails two important problems: a) the amount of information to be collected

Anna Monreale, Francesca Pratesi and Dino PedreschiUniversity of Pisa, Pisa, Italy, e-mail: [email protected], [email protected], [email protected]

Hui (Wendy) WangStevens Institute of Technology, NJ, USA e-mail: [email protected]

Salvatore RinzivilloISTI-CNR, Pisa, Italy, e-mail: [email protected]

Gennady Andrienko and Natalia AndrienkoFraunhofer IAIS, Sankt Augustin, Germany, e-mail: [email protected], [email protected]

1

2 Anna Monreale et al.

and processed may exceed the capacity of the storage and computational resources, and b) theraw data describe the mobility behavior of the individuals with great detail that could enablethe inference of very sensitive information related to the personal private sphere.

In order to solve these problems, we propose a privacy-preserving distributed computationframework for the aggregation of movement data. We assume that on-board location devicesin vehicles continuously trace the positions of the vehicles and can periodically send derivedinformation about their movements to a central station, which stores it. The vehicles providea statistical sample of the whole population, so that the information can be used to computea summary of the traffic conditions on the whole territory. For protecting individual privacy,we propose a data transformation method based on the well-known differential privacy model.To reduce the amount of information that each vehicle communicates to the central station,we propose the application of sketching algorithms to the differentially private data that allowobtaining a compressed representation of the them. The central station, that we call coordina-tor, is able to reconstruct the movement data represented by the sketched data that, althoughtransformed for guaranteeing privacy, preserve some important properties of the original datathat make them useful for mobility analysis.

The remainder of the paper is organized as follows. Section 2 introduces background in-formation and definitions. Section 3 describes the system architecture and states the problem.Section 4 presents our privacy-preserving framework. In Section 5, we discuss the privacyanalysis. Experimental results from applying our method to real-world data are presented anddiscussed in Section 6. Section 8 concludes the paper.

2 Preliminaries

2.1 Movement Data Representation

Definition 1 (Trajectory). A Trajectory or spatio-temporal sequence is a sequence of triplesT =< l1, t1 >,. . . ,< ln, tn >, where ti (i = 1 . . .n) denotes a timestamp such that ∀1≤i<n ti < ti+1and li = (xi,yi) are points in R2.

Intuitively, each pair 〈li, ti〉 indicates that the object is in the position li = 〈xi,yi〉 at time ti.

We assume that the territory is subdivided in cells C = {c1,c2, . . . ,ct} which compose apartition of the territory. During a travel a user goes from a cell to another cell. We use g todenote the function that applies the spatial generalization to a trajectory. Given a trajectory Tthis function generates the generalized trajectory g(T ), i.e. a sequence of moves with temporalannotations, where a move is an pair (lci , lc j), which indicates that the moving object movesfrom the cell ci to the adjacent cell c j. Note that, lci denotes the pair of spatial coordinates rep-resenting the the centroid of the cell ci; in other words lci = 〈xci ,yci〉. The temporal annotatedmove is the quadruple (lci , lc j , tci , tc j) where lci is the location of the origin, lc j is the location ofthe destination and the tci , tc j are the time information associate to lci and lc j . As a consequence,we define a generalized trajectory as follows.

Definition 2 (Generalized Trajectory). Let T = 〈l1, t1〉, . . . ,〈ln, tn〉 be a trajectory. Let C ={c1,c2, . . . ,ct} be the set of the cells that compose the territory partition. A generalized version

Privacy-preserving Distributed Trajectory Generalization and Summarization 3

of T is a sequence of temporal annotated moves

Tg = (lc1 , lc2 , tc1 , tc2)(lc2 , lc3 , tc2 , tc3) . . .(lcm−1 , lcm , tcm−1 , tcm)

with m <= n.

Now, we show how a generalized trajectory can be represented by a frequency distributionvector. First, we define the function Move Frequency MF that given a generalized trajectoryTg, a move (lci , lc j) and a time interval computes how many times the move appears in Tg byconsidering the temporal constraint. More formally, we define the Move Frequency function asfollows.

Definition 3 (Move Frequency). Let Tg be a generalized trajectory and let (lci , lc j) be a move.Let τ be a temporal interval. The Move Frequency function is defined as:

MF(Tg,(lci , lc j),τ) = |{(lci , lc j , ti, t j) ∈ Tg|ti ∈ τ ∧ t j ∈ τ}|.

This function can be easily extended for taking into consideration a set of generalized trajec-tories T G . In this case, the information computed by the function represents the total numberof movements from the cell ci to the cell c j in a time interval in the set of trajectories.

Definition 4 (Global Move Frequency).Let T G be a set of generalized trajectories and let (ci,c j) be a move. Let τ be a time interval.

The Global Move Frequency function is defined as:

GMF(T G ,(ci,c j),τ) = ∑∀Tg∈T G

MF(Tg,(ci,c j),τ).

The number of movements between two cells computed by either the function MF or GMFdescribes the amount of traffic flow between the two cells in a specific time interval. Thisinformation can be represented by a frequency distribution vector.

Definition 5 (Vector of Moves). Let C = {c1,c2, . . . ,ct} be the set of the cells that composethe territory partition. The vector of moves M with s = |{(ci,c j)|ci is ad jacent to c j}| positionsis the vector containing all possible moves. The element M[i] = (lci , lc j) is the move from thecell ci to the adjacent cell c j.

Definition 6 (Frequency Vector). Let C = {c1,c2, . . . ,ct} be the of the cells that compose theterritory partition and let M be the vector of moves. Given a set of generalized trajectoriesin a time interval τ T G . The corresponding frequency vector is the vector f with size s =|{(ci,c j)|ci is ad jacent to c j}| where each f [i] = GMF(T G ,M[i],τ).

The definition of frequency vector of a trajectory set is straightforward; it requires to computethe function GMF instead of MF .

2.2 Differential Privacy

Differential privacy implies that adding or deleting a single record does not significantly affectthe result of any analysis. Intuitively, differential privacy can be understood as follows. Let a


database D include a private data record di about an individual i. By querying the database, itis possible to obtain certain information I(D) about all data and information I(D-di) about thedata without the record di. The difference between I(D) and I(D-di) may enable inferring someprivate information about the individual i. Hence, it must be guaranteed that I(D) and I(D-di)do not significantly differ for any individual i.

The formal definition [13] is the following. Here the parameter, ε , specifies the level ofprivacy guaranteed.

Definition 7 (ε-differential privacy). A privacy mechanism A gives ε-differential privacy iffor any dataset D1 and D2 differing on at most one record, and for any possible output D′ of Awe have

Pr[A(D1) = D′]≤ eε ×Pr[A(D2) = D′]

where the probability is taken over the randomness of A.

Two principal techniques for achieving differential privacy have appeared in the literature,one for real-valued outputs [13] and the other for outputs of arbitrary types [20]. A fundamentalconcept of both techniques is the global sensitivity of a function mapping underlying datasetsto (vectors of) reals.

Definition 8 (Global Sensitivity). For any function f : D→ Rd , the sensitivity of f is

∆ f = maxD1,D2 || f (D1)− f (D2)||1

for all D1, D2 differing in at most one record.

For the analysis whose outputs are real, a standard mechanism to achieve differential privacyis to add Laplace noise to the true output of a function. Dwork et al. [13] propose the Laplacemechanism which takes as inputs a dataset D, a function f , and the privacy parameter ε . Themagnitude of the noise added conforms to a Laplace distribution with the probability densityfunction p(x|λ ) = 1

2λe|x|/λ , where λ is determined by both the global sensitivity of f and the

desired privacy level ε .

Theorem 1. [13] For any function f : D→ Rd over an arbitrary domain D, the mechanism AA(D) = f (D)+Laplace(∆ f/ε) gives ε-differential privacy.

A relaxed version of differential privacy discussed in [3] allows claiming the same privacylevel as Definition 7 in the case there is a small amount of privacy loss due to a variation in theoutput distribution for the privacy mechanism A is as follows.

Definition 9 (ε,δ -differential privacy). A privacy mechanism A gives ε-differential privacyif for any dataset D1 and D2 differing on at most one record, and for any possible output D′ ofA we have

Pr[A(D1) = D′]≤ eε ×Pr[A(D2) = D′]+δ

where the probability is taken over the randomness of A.

Note that, if δ = 0, (ε;0)-differential privacy is ε-differential privacy. In the remaining ofthis paper we will refer to this last version of differential privacy.


3 Problem Definition

3.1 System Architecture

We consider a system architecture as that one in described in [8]. In particular, we assume adistributed-computing environment comprising a collection of k (trusted) remote sites (nodes)and a designated coordinator site. Streams of data updates arrive continuously at remote sites,while the coordinator site is responsible for generating approximate answers to periodic userqueries posed over the unions of remotely-observed streams across all sites. Each remote siteexchanges messages only with the coordinator, providing it with state information on its (lo-cally observed) streams. There is no communication between remote sites.

In our scenario, the coordinator is responsible for computing the aggregation of movementdata on a territory by combining the information received by each node. In order to obtain theaggregation of the movement data in the centralized setting we need to generalize all the trajec-tories by using the cells of a partition of the territory. In our distributed setting we assume thatthe partition of the territory, i.e., the set of cells C = {c1, . . . ,cm} useful for the generalization,is known by both all the nodes and the coordinator. Each node, that represents a vehicle thatmoves in this territory, in a given time interval observes a sequence of spatio-temporal points(trajectory), generalizes it and contributes to the computation of the global vector.

Formally, each remote site j ∈ {1, . . . ,k} observes local update streams that incrementallyrender a collection of (up to) s distinct frequency distribution vectors (equivalently, multi-sets)f1, j, . . . , fs, j over data elements from corresponding integer domains [Ui] = {0, . . . ,Ui1}, fori = 1, . . . ,s; that is, fi, j[v] denotes the frequency of element v ∈ [Ui] observed locally at remotesite j.

The coordinator for each i ∈ {1, . . . ,s} computes the global frequency distribution vectorfi = ∑

kj=1 fi, j.

3.2 Privacy Model

In our setting we assume that each node in our system is secure; in other words we do notconsider attacks at node level. Instead, we take into consideration possible attacks from anyintruder between the node and the coordinator (i.e., attacks during the communications), andany intruder at coordinator site, so our privacy preserving technique has to guarantee privacyeven against a malicious behavior of the coordinator. We consider sensitive information asany information from which the typical mobility behavior of a user may be inferred. Thisinformation is considered sensitive for two main reasons: 1) typical movements can be usedto identify the drivers who drive specific vehicles even when a simple de-identification of theindividual in the system is applied; and 2) the places visited by a driver could identify specificsensitive areas such as clinics, hospitals, the user’s home.

Therefore, we need to find effective privacy mechanisms on the real count associate to eachmove, in order to generate uncertainty. As a consequence, the goal of our framework is tocompute a distributed aggregation of movement data for a comprehensive exploration of themwhile preserving privacy.


Definition 10 (Problem Definition). Given a set of cells C = {c1, . . . ,cm} and a set V ={V1, . . .Vk} of vehicles, the privacy-preserving distributed movement data aggregation prob-lem (DMAP) consists in computing, in a specific time interval τ the f τ

DMAP(V ) = [ f1, f2, . . . , fs],where each fi = GMF(T G ,M[i],τ) and s = |{(ci,c j)|ci is ad jacent to c j}|, while preservingprivacy. Here, T G is the set of generalized trajectories related to the k vehicles V in the timeinterval τ and M is the vector of moves defined on the set of cells C.

4 Our Solution

Solving this problem means to define a process describing the computation of each node andof the coordinator. Clearly, in order to guarantee the privacy within this framework we may ap-ply many privacy-preserving techniques depending on the privacy attack model and the back-ground knowledge of the adversary that we want to consider in this scenario. In this paper, weprovide a solution based on the differential privacy, that is a very strong privacy model indepen-dent on the the background knowledge of an adversary. In this section, we describe the detailsof our solution, including the computation of each node and the coordinator in the system.

The pseudo code of our algorithm is shown in Algorithm 1. Each node represents a vehiclethat moves in a specific territory and this vehicle in a given time interval observes sequencesof spatio-temporal points (trajectories) and computes the corresponding frequency vector to besent to the coordinator. The computation of the frequency vector requires four steps describedin Algorithm 1: (a) trajectory generalization; (b) frequency vector construction; (c) frequencyvector transformation for privacy guarantees and (d) vector sketching for compressing the in-formation to be communicated.

4.1 Trajectory Generalization

Given a specific division of the territory, a trajectory is generalized in the following way. Weapply place-based division of the trajectory into segments. The area c1 containing its first pointl1 is found. Then, the second and following points of the trajectory are checked for being insidec1 until finding a point li not contained in c1. For this point li, the containing area c2 is found.

The trajectory segment from the first point to the i-th point is represented by the vector(c1,c2). Then, the procedure is repeated: the points starting from li+1 are checked for contain-ment in c2 until finding a point lk outside c2, the area c3 containing lk is found, and so forth upto the last point of the trajectory.

In the result, the trajectory is represented by the sequence of moves (c1,c2, t1, t2)(c2,c3, t2, t3). . .(cm−1,cm, tm−1, tm). Here, in a specific quadruple ti is the time moment of the last positionin ci and t j is the time moment of the last position in c j. There may be also a case when allpoints of a trajectory are contained in one and the same area c1. Then, the whole trajectory isrepresented by the sequence {c1}. Since, globally we want to compute aggregation of moveswe discard this kind of trajectories. Moreover, as most of the methods for analysis of trajecto-ries are suited to work with positions specified as points, the areas {c1,c2, . . . ,cm} are replaced,for practical purposes, by the sequence lc1 , lc2 , . . . , lcn consisting of the centroids of the areas{c1,c2, . . . ,cn}.


Algorithm 1: NodeComputation(ε , τ , M, T G, w, d)Input: A privacy budget ε , a time interval τ , the vector of the moves M, a set of trajectories T G, the

dimension of the sketch w and d.Output: The sketch vector representing the privacy-preserving frequency vector sk( f V j ).foreach observed trajectory T do1

// Trajectory Generalization (Sec. 4.1)Tg = Tra jectoryGeneralization(M,T );2// Update of the Frequency Vector f V j (Sec. 4.2)foreach move (lci , lc j ) ∈ Tg do3

n = MF(Tg,(lci , lc j ),τ);4f V j [(lci , lc j )]+ = n;5

// Privacy Transformation (Sec. 4.3)foreach vector element f V j [i] do6

noise = Laplace(0, 1ε);7

if noise > f V j [i] then8noise = f V j [i];9

if noise <− f V j [i] then10noise =− f V j [i];11

f V j [i] = f V j [i]+noise;12

// Generation of Sketch Vector (Sec. 4.4)sk( f V j ) =CountMin( f V j );13return sk( f V j ,w,d)14

4.2 Frequency Vector Construction

After the generalization of a trajectory, the node computes the Move Frequency function foreach move (lci , lc j) in that trajectory and updates its frequency vector f V j associated to thecurrent time interval τ . Intuitively, the vehicle populates the frequency vector f V j accordingthe generalized trajectory observed. So, at the end of the time interval τ the element f V j [i]contains the number of times that the vehicle Vj moved from m to n in that time interval, ifM[i] = (m,n).

4.3 Vector Transformation for Achieving Privacy

As we stated in Section 3.2, if a node sends the frequency vector without any data transfor-mation any intruder may infer the typical movements of the vehicle represented by the node.As an example, he could learn his most frequent move; this information can be consideredvery sensitive because the cells of this move usually correspond to user’s home and his workplace. Clearly, the generalization step can help the privacy user but it depends on the densityof the area; specifically, if the area is not so dense it could identify few places and in that casethe privacy is at risk. How can we hide the event that the user moved from a location a to alocation b in the time interval τ? We propose a solution based on a very strong privacy modelcalled ε-differential privacy (Section 2.2). As explained above the key point of this model is


the definition of the sensitivity. Given a move (a,b) its sensitivity is straightforward: releasingits frequency have sensitivity 1 as adding or removing a single flow can affect its frequencyby at most 1. Thus adding noise according to Lap( 1

ε) to the frequency of each of the moves in

the frequency vector satisfies ε-differential privacy. As a consequence, at the end of the timeinterval τ , before sending the vector to the coordinator, for each position of the vector (i.e., foreach move) has to generate the noise by the Laplace distribution with zero mean and scale 1

ε

and then has to add it to the value in that position of the vector. At the end of this step the nodetransforms f V j into f V j .

Differential privacy must be applied with caution because in some context it could lead tothe destruction the data utility because of the added noise that, although with small probabil-ity, can reach values of arbitrary magnitude. Moreover, adding noise drawn from the Laplacedistribution could generate negative values for the flow in a move and negative flows does notmake sense. To prevent this two problem we decided to draw the noise from a cutting version ofthe Laplace distribution. In particular, for each value x of the vector f V j we draw the noise fromLap( 1

ε) bounding the noise value to the interval [−x,x]. In other words, if we have the original

flow f V j [i] = x in the perturbed version we obtain a flow value in the interval [0,2x]. The useof a truncated version of the Laplace distribution can lead to privacy leaks and in Section 5 weshow that our privacy mechanism satisfies (ε,δ )-differential privacy, where δ represents thisprivacy loss.

4.4 Vector Sketching for Compact Communications

In a system like ours an important issue to be considered is the amount of data to be commu-nicated. In fact, real life systems usually involve thousands vehicles (nodes) that are locatedin any place of the territory. Each vehicle has to send to the coordinator the information con-tained in its frequency vector that has a size depending on the number of cells that representthe partition of the territory. The number of cell in a territory can be very huge and this canmake each frequency vector too big. As an example, in the dataset of real-life trajectories usedin our experiments we have about 4,200 vehicles and a frequency vector with about 15,900positions. Therefore, the system has to be able to handle both a very large number of nodesand a huge amount of the global information to be communicated. These considerations makethe reduction of the information communicated necessary. We propose the application of asketching method [9] that allows us to apply a good compression of the information to becommunicated. In particular, we propose the application of Count-Min sketch algorithm [7]. Ingeneral, this algorithm maps the frequency vector onto a more compressed vector. In particular,the sketch consists of an array C of d×w counters and for each of d rows a pairwise indepen-dent hash functions h j, that maps items onto [w]. Each item is mapped onto d entries in thearray, by adding to the previous value the new item. Given a sketch representation of a vectorwe can estimate the original value of each component of the vector by the following functionf [i] = min1≤ j≤dC[ j,h j(i)]. The estimation of each component j is affected by an error, but itis showed that the overestimate is less than n/w, where n is the number of components. So,setting d = log 1

γand w = O( 1

α) ensures that the estimation of f [i] has error at most αn with

probability at least 1− γ .


4.5 Coordinator Computation

The computation of the coordinator is composed of two main phases: 1) computation of the setof moves and 2) computation of the aggregation of global movements.

Move Vector Computation. The coordinator in an initial setting phase has to send to thenodes the vector of moves (Definition 5). The computation of this vector depends on the setof cells that represent the partition of the territory. This partition can be a simple grid or amore sophisticated territory subdivision such as the Voronoi tessellation. The sharing of vectorof moves is a requirement of the whole process because each node has to use the same datastructure for allowing the coordinator the correct computation of the global flows.

Global Flow Computation.The coordinator has to compute the global vector that correspondsto the global aggregation of movement data in a given time interval τ by composing all the lo-cal frequency vectors. It receives the sketch vector sk( f V j) from each node; then it reconstructseach frequency vector from the sketch vector, by using the estimation described in Section 4.4.Finally, the coordinator computes the global frequency vector by summing the estimate vec-tors component by component. Clearly the estimate global vector is an approximated versionof the global vector obtained by summing the local frequency vectors after the only privacytransformation.

5 Privacy Analysis

As pointed out in [17] differential privacy must be applied with caution. The privacy protectionprovided by differential privacy relates to the data generating mechanism and deterministicaggregate level background knowledge. As in our problem the trajectories in the raw databaseare independent of each other, and no deterministic statistics of the raw database will ever bereleased, we are ready to show that Algorithm 1 satisfies (ε , δ )-differential privacy.

Let F and F ′ be the frequency distribution before and after adding Laplace noise. We ob-serve that bounding the Laplace noise will lead to some privacy leakage on some values. Forinstance, from the noisy frequency values that are large, the attacker can infer that these valuesshould not be transformed from small ones. To analyze the privacy leakage of our bound-noiseapproach, we first explain the concept of statistical distance. Statistical distance is defined in[3]. Formally, given two distributions X and Y , the statistical distance between X and Y over aset U is defined as

d(X ,Y ) = maxS∈U (Pr[X ∈ S]−Pr[Y ∈ S]).

[3] also shows the relationship between (ε,δ )-differential privacy and the statistical dis-tance.

Lemma 1. [3] Given two probabilistic functions F and G with the same input domain, whereF is (ε,δ1)-differentially private. If for all possible inputs x we have that the statistical distanceon the output distributions of F and G is:

d(F(x),G(x))≤ δ2,


then G is (ε,δ1 +(eε +1)δ2)-differentially private.

Let F and F ′ be the frequency distribution before and after adding Laplace noise. We canshow that the statistical distance between F and F ′ can be bounded as follows:

Lemma 2. [3] Given an (ε,δ )-differentially private function F with F(x) = f (x)+R for a de-terministic function f and a random variable R. Then for all x, the statistical distance betweenF and its throughput-respecting variant F ′ with the bound b on R is at most

d(F(x)−F ′(x))≤ Pr[|R|> b].

[3] has the following lemma to bound the probability Pr[|R|> b].

Lemma 3. [3] Given a function F with F(x) = f (x)+Lap(∆ fε

) for a deterministic function f ,the probability that the Laplacian noise Lap(∆ f

ε) applied to f is larger than b is bounded by:

Pr(Lap(∆ fε

)> b)≤ 2(∆ f )2

b2ε2 .

We stress that in our approach, the bound b of each frequency value x is not fixed. In-deed, b = x. Therefore, each frequency value x has different amounts of privacy leakage. Ourapproach thus achieves different degree of (ε,δ )-differentially privacy guarantee on each fre-quency value x. Theorem 2 shows more details.

Theorem 2. Given the total privacy budget ε , for each frequency value x, Algorithm 1 ensures(ε,(eε +1) 2

x2ε2 )-differentially privacy.Proof. Algorithm 1 consists of four steps, namely TrajectoryGeneralization, FrequencyVec-

torUpdate, PrivacyTransformation, and SketchVectorGeneration. The steps of TrajectoryGen-eralization and FrequencyVectorUpdate mainly prepare the frequency vectors for privacytransformation. Hence we focus on the privacy guarantee of PrivacyTransformation andSketchVectorGeneration steps. For each frequency value x, the PrivacyTransformation stepcan achieve (ε,(eε + 1) 2(∆ f )2

x2ε2 -differentially privacy. This can be easily proven by Lemma 1and Lemma 3. Note that the the frequency vectors with Laplace noise (without truncation)satisfies (ε,0)-differentially privacy. In our approach, ∆ f = 1. Thus the PrivacyTransforma-tion step can achieve (ε,(eε + 1) 2

x2ε2 )-differentially privacy. For the SketchVectorGenerationstep, it only accesses a differentially private frequency vector, not the underlying database. Asproven by Hay et al. [16], a post-processing of differentially private results remains differen-tially private. Therefore, Algorithm 1 as a whole maintains (eε +1) 2

x2ε2 )-differentially privacy.

6 Experiments

6.1 Dataset

For our experiments we used a large dataset of GPS vehicles traces, collected in a period from1st May to 31st May 2011.


6.2 Space Tessellation

The generalization and aggregation of movement data is based on space partitioning. Arbitraryterritory divisions, such as administrative districts or regular grids, do not reflect the spatial dis-tribution of the data. The resulting aggregations may not convey the essential spatial and quan-titative properties of the traffic flows over the territory. Our method for territory partitioningextends the data-driven method suggested in paper [2]. Using a given sample of points (whichmay be, for example, randomly selected from a historical set of movement data), the originalmethod finds spatial clusters of points that can be enclosed by circles with a user-chosen ra-dius. The centroids of the clusters are then taken as generating seeds for Voronoi tessellationof the territory. We have modified the method so that dense point clusters can be subdividedinto smaller clusters, so that the sizes of the resulting Voronoi polygons vary depending onthe point density: large polygons in data-sparse areas and small polygons in data-dense areas.The method requires the user to set 3 parameters: maximal radius R, minimal radius r, andminimal number of points N allowing a cluster to be subdivided. In our experiments, we useda tessellation with 2681 polygons obtained with R = 10km, r = 500m, N = 80.

6.3 Utility Evaluation

When a node elaborates its Movement Vector to be sent to the coordinator, it applies a se-ries of transformation: the privacy transformation –with the objective of protecting sensitiveinformation–, and sketches transformation –to reduce the volume of communication to be sent–. In this section we analytically measure the error introduced by these transformation in orderto assess the utility of the resulting transformed dataset.

From Figure 6.3 it is possible to visually compare the resulting flow visualization. Theoriginal flow (A) is draw with arrows with thickness proportional to the volume of trajectoriesobserved globally on a link. The resulting flows from the transformed personal Move Vectorsare presented in the adjacent pictures. We tested the transformation with ε = 0.9,0.8, . . . ,0.1but we show three maps corresponding to the values of 0.9, 0.5, and 0.3.

The second transformation after the privacy protection adds a new perturbation to the data.However, the resulting aggregation computed by the coordinator is still comparable with theoriginal data. For example, in Figure 6.3 we have considered a privacy transformation withε = 0.3 and three different sketch perturbation, respectively with parameters tocomplete.

The two comparisons proposed above give the intuition that, while the transformations pro-tect individual sensible information, the utility of data is preserved. In the following we explorethis assumption also analytically. Since the two transformation operate on flows of the move-ment vector, we compare two measures: (1) the flow per link (fpl), i.e. the directed volume oftraffic between two adjacent zones; (2) the flow per zone (fpz), i.e. the sum of the incomingand outgoing flows in a zone. Figure 6.3 shows the resulting distributions of different privacytransformation with ε = 0.9,0.8, . . . ,0.1. In Figure 6.3 (left) shows how the flows per link arepreserved by all the privacy transformations: fixed a value of flow (x) we count the number oflinks (y) that have that flow. It can be noted how the distribution of this measure is preservedfor all the transformations. Figure 6.3(right) show the aggregate flows passing for each zone.


Fig. 1 Resulting movement data aggregation for the traffic observed during a single day (May 2011) withoutand with different privacy transformations:(A) without privacy transformation; (B) DP with ε = 0.9; (C) DPwith ε = 0.5; (D) DP with ε = 0.3

It is evident from the figure that the transformations with different parameters yield similardistributions. However, they all tend to increase the number of zones with low flow (i.e. flowsless than 80) and reduce the number of zone with high flow. Since the number of zones isfixed with the Movement Vector, this means that the flow in some zones is increased, while inother zones it is decreased. To maintain the data utility for mobility density analysis, we want topreserve the relative density distribution over the zones, i.e. it is desirable that former zone withlow (high) traffic still present low (high) traffic after the transformation. To check this property,we show in Figure 6.3 the correlation plots to compare the original flows with the transformedones. From the charts it can be noted how the transformed flows are underestimated when theoriginal flow is above 80, as we have noticed in Figure 6.3.

In general, flow distribution is preserved also spatially, where zone with dense traffic arepreserved also after the transformation, as can be noted in Figure 6.3. The figure show fourmaps of density visualization by circles, where the size of each circle is proportional to thedifference from the local median of each dataset. The maps allow us to recognize the denseareas (red circles, above the median) separated by sparse areas (blu circle below the median).The high density traffic zones follow the highways and the major city centers along their routes.


Fig. 2 Resulting movement data aggregation for the traffic observed during a single day (?? May 2011) witha privacy transformation with ε = 0.3 and three sketch transformation with different parameters:(A) withoutprivacy transformation; (B) DP with ε = 0.9; (C) DP with ε = 0.5; (D) DP with ε = 0.3

Fig. 3 Distribution of flow per link (left) and flow per zone (right).

7 Related Work

The existing methods of privacy-preserving publishing of trajectory data can be categorizedinto two classes: (1) generalization/suppression based data perturbation, and (2) differentialprivacy. We discuss the related work of these two categories below.Generalization/suppression based data perturbation techniques. There have been some re-cent works on privacy-preserving publishing of spatio-temporal moving points by using thegeneralization/suppression techniques. The mostly widely used privacy model of these work


Fig. 4 Correlation of transformed flows with DP and ε = 0.9 (left) ε = 0.3 (rught).

is adapted from what so called k-anonymity [25, 24], which requires that an individual shouldnot be identifiable from a group of size smaller than k based on their quasi-identifies (QIDs),i.e., a set of attributes that can be used to uniquely identify the individuals. Abul et al. [1]propose the (k,δ )-anonymity model that exploits the inherent uncertainty of the moving ob-ject’s whereabouts, where δ represents possible location imprecision. Terrovitis and Mamoulis[26] assume that different adversaries own different, disjoint parts of the trajectories. Theiranonymization technique is based on the awareness of the adversarial knowledge, and is ob-tained only by means suppression of the dangerous observations from each trajectory. Yarovoyet al. [29] consider timestamps as the quasi-identifiers, and define a novel attack that can as-sociate moving objects with (anonymized) trajectories by using attack graphs. They definek-anonymity to defend against such attack. Monreale et al. [22] propose a spatial generaliza-tion approach to achieve k-anonymity. A general problem of these k-anonymity based privacypreserving techniques is that these techniques assume a certain level of background knowledgeof the attackers, which may not be available to the data owner in practice.Differential privacy. The recently proposed concept of differential privacy (DP) [13] ad-dresses the above issue by providing rigorous privacy guarantees against adversaries witharbitrary background information. Differential privacy was first formally presented in [13].There are two popular mechanisms to achieve differential privacy, Laplace mechanism thatsupports queries whose outputs are numerical [13] and exponential mechanism that works forany queries whose output spaces are discrete [20]. The basic idea of the Laplace mechanismis to add noise to aggregate queries (e.g., counts) or queries that can be reduced to simple ag-gregates. The Laplace mechanism has been widely adopted in many existing work for variousdata applications. For instance, [27, 10] present methods for minimizing the worst-case errorof count queries; [4, 12] consider the publication of data cubes; [16, 28] focus on publishinghistograms; and [21, 18] propose the methods of releasing data in a differential private way fordata mining. On the other hand, for the analysis whose outputs are not real or make no sense af-ter adding noise, the exponential mechanism selects an output from the output domain, r ∈ R,by taking into consideration its score of a given utility function q in a differentially private


Fig. 5 Traffic flow per zone drawn with circles proportional to the difference from the median for each trans-foramtion with privacy transformations with different parameters:(A) without privacy transformation; (B) DPwith ε = 0.9; (C) DP with ε = 0.5; (D) DP with ε = 0.3

manner. It has been applied for the publication of audition results [20], coresets [14], frequentpatterns [5] and decision trees [15].

Regarding publishing differentially private spatial data, Chen et al. [6] propose to releasea prefix tree of trajectories with injected Laplace noise. Each node in the prefix tree containsa doublet in the form of < tr(v),c(v) >, where tr(v) is the set of trajectories of the prefix v,and c(v) is a version of |tr(v)| with Laplace noise. Compared with our work, the prefix treein [6] is data-dependent, i.e., it should have a different structure when the underlying databasechanges. In our work, the frequency vector is data-independent. Cormode et al. present a so-lution to publish differentially private spatial index (e.g., quadtrees and kd-trees) to provide aprivate description of the data distribution [10]. Its main utility concern is the accuracy of multi-dimensional range queries (e.g., how many individuals fall within a given region). Therefore,the spatial index only stores the count of a specific spatial decomposition. It does not storethe movement information (e.g., how many individuals move from location i to location j) as


in our work. In another paper, Cormode et al. [11] proposes to publish a contingency table oftrajectory data. The contingency table can be indexed by specific locations so that each cell inthe table contains the number of people who commute from the given source to the given des-tination. The contingency table is very similar to our frequency vector structure. However, [11]has a different focus from ours: we investigate how to publish the frequency vector in a differ-ential privacy way, while [11] address the sparsity issue of the contingency table and presentsa method of releasing a compact summary of the contingency table with Laplace noise.

There are some work on publishing time-series data with differential privacy guarantee [19,23]. Since we only consider spatial data, these work are complement to our work.

8 Conclusion

In this paper, we have studied the problem of computing movement data aggregation based ontrajectory generalization in a distributed system while preserving privacy. We have proposeda method based on the well-known notion of differential privacy that provides very nice dataprotection guarantees. In particular, in our framework each vehicle, before sending the infor-mation about its movements within a time interval, applies to the data a transformation forachieving privacy and then, creates a summarization of the private data (by using a sketchingalgorithm) for reducing the amount of information to be communicated. The results obtainedin our experiments show that the proposed method preserves some important properties of theoriginal data allowing the analyst to use them for important mobility data analysis.

Future investigations could be directed to explore other methods for achieving differentialprivacy; as an example, it would be interesting to understand the impact of the use of thegeometric mechanism instead of the Laplace one for achieving differential privacy.

References

1. Osman Abul, Francesco Bonchi, and Mirco Nanni. Never walk alone: Uncertainty for anonymity in movingobjects databases. In Proceedings of the 2008 IEEE 24th International Conference on Data Engineering(ICDE), pages 376–385, 2008.

2. Natalia Andrienko and Gennady Andrienko. Spatial generalization and aggregation of massive movementdata. IEEE Transactions on Visualization and Computer Graphics, 17:205219, 2011.

3. Michael Backes, Sebastian Meiser. Differentially Private Smart Metering with Battery Recharging. IACRCryptology ePrint Archive 2012: 183 (2012)

4. Boaz Barak, Kamalika Chaudhuri, Cynthia Dwork, Satyen Kale, Frank McSherry, and Kunal Talwar. Pri-vacy, accuracy, and consistency too: a holsistic solution to contingency table release. In Proceedings ofthe twenty-sixth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems (PODS),pages 273–282, 2007.

5. Raghav Bhaskar, Srivatsan Laxman, Adam Smith, and Abhradeep Thakurta. Discovering frequent pat-terns in sensitive data. In Proceedings of the 16th ACM SIGKDD international conference on Knowledgediscovery and data mining (KDD), pages 503–512, 2010.

6. Rui Chen, Benjamin C.M. Fung, Bipin C. Desai, and Neriah M. Sossou. Differentially private transit datapublication: a case study on the montreal transportation system. In Proceedings of the 18th ACM SIGKDDinternational conference on Knowledge discovery and data mining (KDD), pages 213–221, 2012.

7. Graham Cormode and S. Muthukrishnan. An improved data stream summary: the count-min sketch andits applications. J. Algorithms, 55(1):58–75, 2005.


8. Graham Cormode and Minos N. Garofalakis. Approximate continuous querying over distributed streams.ACM Trans. Database Syst., 33(2), 2008.

9. Graham Cormode, Minos N. Garofalakis, Peter J. Haas, and Chris Jermaine. Synopses for massive data:Samples, histograms, wavelets, sketches. Foundations and Trends in Databases, 4(1-3):1–294, 2012.

10. Graham Cormode, Cecilia M. Procopiuc, Divesh Srivastava, Entong Shen, and Ting Yu. Differentiallyprivate spatial decompositions. In ICDE, pages 20–31, 2012.

11. Graham Cormode, Cecilia M. Procopiuc, Divesh Srivastava, and Thanh T. L. Tran. Differentially privatesummaries for sparse data. In ICDT, pages 299–311, 2012.

12. Bolin Ding, Marianne Winslett, Jiawei Han, and Zhenhui Li. Differentially private data cubes: optimizingnoise sources and consistency. In Proceedings of the 2011 ACM SIGMOD International Conference onManagement of data, pages 217–228, 2011.

13. Cynthia Dwork, Frank McSherry, Kobbi Nissim, and Adam Smith. Calibrating noise to sensitivity inprivate data analysis. In Proceedings of the Third conference on Theory of Cryptography (TCC), pages265–284.

14. Dan Feldman, Amos Fiat, Haim Kaplan, and Kobbi Nissim. Private coresets. In Proceedings of the 41stannual ACM symposium on Theory of computing (STOC), pages 361–370, 2009.

15. Arik Friedman and Assaf Schuster. Data mining with differential privacy. In Proceedings of the 16th ACMSIGKDD international conference on Knowledge discovery and data mining, pages 493–502, 2010.

16. Michael Hay, Vibhor Rastogi, Gerome Miklau, and Dan Suciu. Boosting the accuracy of differentiallyprivate histograms through consistency. Proc. VLDB Endow., 3(1-2):1021–1032, September 2010.

17. Daniel Kifer and Ashwin Machanavajjhala. No free lunch in data privacy. In Timos K. Sellis, Renee J.Miller, Anastasios Kementsietsidis, and Yannis Velegrakis, editors, SIGMOD Conference, pages 193–204.ACM, 2011.

18. Ninghui Li, Wahbeh H. Qardaji, Dong Su, and Jianneng Cao. Privbasis: Frequent itemset mining withdifferential privacy. PVLDB, 5(11):1340–1351, 2012.

19. Frank McSherry and Ratul Mahajan. Differentially-private network trace analysis. In Proceedings of theACM SIGCOMM 2010 conference, pages 123–134, 2010.

20. Frank McSherry and Kunal Talwar. Mechanism design via differential privacy. In Proceedings of the 48thAnnual IEEE Symposium on Foundations of Computer Science (FOCS), pages 94–103, 2007.

21. Noman Mohammed, Rui Chen, Benjamin C.M. Fung, and Philip S. Yu. Differentially private data re-lease for data mining. In Proceedings of the 17th ACM SIGKDD international conference on Knowledgediscovery and data mining, 2011.

22. Anna Monreale, Gennady L. Andrienko, Natalia V. Andrienko, Fosca Giannotti, Dino Pedreschi, SalvatoreRinzivillo, and Stefan Wrobel. Movement data anonymity through generalization. Transactions on DataPrivacy, 3(2):91–121, 2010.

23. Vibhor Rastogi and Suman Nath. Differentially private aggregation of distributed time-series with trans-formation and encryption. In SIGMOD, pages 735–746, 2010.

24. P. Samarati and L. Sweeney. Protecting privacy when disclosing information: k-anonymity and its en-forcement through generalization and suppresion. In Proc. of the IEEE Symp. on Research in Security andPrivacy, pages 384–393, 1998.

25. Pierangela Samarati and Latanya Sweeney. Generalizing data to provide anonymity when disclosing infor-mation(abstract). In Proc. of the 17th ACM Symp. on Principles of Database Systems (PODS), 1998.

26. M. Terrovitis and N. Mamoulis. Privacy preservation in the publication of trajectories. In Proc. of the 9thInt. Conf. on Mobile Data Management (MDM), 2008.

27. Xiaokui Xiao, Guozhang Wang, and Johannes Gehrke. Differential privacy via wavelet transforms. IEEETransaction on Knowledge and Data Engineering (TKDE), 23(8):1200–1214, August 2011.

28. Jia Xu, Zhenjie Zhang, Xiaokui Xiao, Yin Yang, and Ge Yu. Differentially private histogram publication.In ICDE, pages 32–43, 2012.

29. Roman Yarovoy, Francesco Bonchi, Laks V. S. Lakshmanan, and Wendy Hui Wang. Anonymizing movingobjects: how to hide a mob in a crowd? In EDBT, pages 72–83, 2009.

Documents

Privacy-preserving Distributed Movement Data Aggregationhwang4/papers/AGILE_Flows.pdf · e-mail: [email protected] Gennady Andrienko and Natalia Andrienko Fraunhofer