11
Be a Collaborator and a Competitor in Crowdsourcing System Iheb Ben Amor, Mourad Ouziri, Soror Sahri, Naouel Karam To cite this version: Iheb Ben Amor, Mourad Ouziri, Soror Sahri, Naouel Karam. Be a Collaborator and a Com- petitor in Crowdsourcing System. Mascots2014, 2014, pp.12. <hal-01062773> HAL Id: hal-01062773 https://hal.archives-ouvertes.fr/hal-01062773 Submitted on 12 Sep 2014 HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci- entific research documents, whether they are pub- lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers. L’archive ouverte pluridisciplinaire HAL, est destin´ ee au d´ epˆ ot et ` a la diffusion de documents scientifiques de niveau recherche, publi´ es ou non, ´ emanant des ´ etablissements d’enseignement et de recherche fran¸cais ou ´ etrangers, des laboratoires publics ou priv´ es.

Be a Collaborator and a Competitor in Crowdsourcing System · 2017-01-21 · with a probabilistic reasoning framework. In [6], the authors present an hybrid human machine pipeline,

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Be a Collaborator and a Competitor in Crowdsourcing System · 2017-01-21 · with a probabilistic reasoning framework. In [6], the authors present an hybrid human machine pipeline,

Be a Collaborator and a Competitor in Crowdsourcing

System

Iheb Ben Amor, Mourad Ouziri, Soror Sahri, Naouel Karam

To cite this version:

Iheb Ben Amor, Mourad Ouziri, Soror Sahri, Naouel Karam. Be a Collaborator and a Com-petitor in Crowdsourcing System. Mascots2014, 2014, pp.12. <hal-01062773>

HAL Id: hal-01062773

https://hal.archives-ouvertes.fr/hal-01062773

Submitted on 12 Sep 2014

HAL is a multi-disciplinary open accessarchive for the deposit and dissemination of sci-entific research documents, whether they are pub-lished or not. The documents may come fromteaching and research institutions in France orabroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, estdestinee au depot et a la diffusion de documentsscientifiques de niveau recherche, publies ou non,emanant des etablissements d’enseignement et derecherche francais ou etrangers, des laboratoirespublics ou prives.

Page 2: Be a Collaborator and a Competitor in Crowdsourcing System · 2017-01-21 · with a probabilistic reasoning framework. In [6], the authors present an hybrid human machine pipeline,

Be a Collaborator and a Competitor inCrowdsourcing System

Iheb Ben AmorUniversite Paris Sorbone Cite

Paris DescartesFrance

Email: [email protected]

Mourad Ouziri and Soror SahriUniversite Paris Sorbone Cite

Paris DescartesFrance

Email: [email protected]

Naouel KaramFreie Universitat Berlin

GermanyEmail: naouel.karam

@fu-berlin.de

Abstract—Crowdsourcing is emerging as a powerful paradigmto solve a wide range of tedious and complex problems invarious enterprise applications. It spawns the issue of findingthe unknown collaborative and competitive group of solvers. Theformation of collaborative team should provide the best solutionand treat that solution as a trade secret avoiding data leakbetween competitive teams due to reward behind the outsourcingof the issue. The formation of effective competitive teams not onlyrequires adequate skills between members of each team, but alsogood team connectivity through social network and to provide thebest solution and treat that solution as a trade secret avoiding dataleak between teams due to reward behind the outsourcing of theissue. In this paper, we propose a data leak aware crowdsourcingsystem called SocialCrowd. We introduce a clustering algorithmthat uses social relationships between crowd workers to discoverall possible teams while avoiding inter-team data leakage.

I. INTRODUCTION

A new trend of teamwork has been emerged unconstrainedby local geography, available skill set, networking anddeep relationships the crowdsourcing. It is the action ofoutsourcing tasks, traditionally performed by an employeeor contractor, to an undefined group of people through anopen call [3]. Crowdsourcing applications should be enableto seek for people crowd on demand to perform a widerange of complex and difficult tasks. Thousands human actorswill provide their skills and capabilities in response to the call.

We introduce a type of crowdsourcing called SocialCrowdbased on the efficiency of social network to outsource a taskto be performed by people on demand instead of an openworld as Mechanical Turk is doing. In fact, the interactionsbetween people involved to answer a query become complexmore and more and the collaboration leads to the emergenceof social relations and a social network can be weaved forhuman-task environment.

The goal of SocialCrowd system is to let people collaborat-ing on a joint task in the crowd environment where they mayseek for other members towards social crowd relationships forachieving a business goal. Thus, many competitive teams canprovide a set of answers to the call. However, as the crowd taskis competitive between teams, it is important to group people ina manner there is no inter-teams leaking. Such mechanism willavoid the information leak between crowd people in different

teams. Hence, the issues and challenges considered in oursystem, include:

(i) how to build up and discover teams answering the querytowards the social relationships.

(ii) how to avoid the solution disclosure of the problemduring the teams construction between competitive groups.

In fact, people on demand collaborating to a task may sharesensitive information (part of the problem solution)that may bepropagated or forwarded to other crowd members in the socialnetwork.

Few works dealing with crowdsourcing are provided. In[7], the Trivia Master system generates a very large Databaseof facts in a variety of topics, cleans it towards a gamemechanism and uses it for question answering. In [14] isproposed a novel approach for integrating human capabilitiesin crowd process flows.

In [10]is introduced an answering queries withCrowdsourcing. The authors explain their challengesand present an initial solution to the problems of findingnew data and comparing data. For this, they propose asimple SQL schema and query extensions that enable theintegration of crowdsourced data and processing, they presentalso the design of croudDB including new crowdsourcedquery operators and plan generation techniques that combinecrowdsourced and traditional query operators. After that, theydescribe methods for automatically generating effective userinterfaces for crowdsourced tasks, and present the result ofmicro-benchmarks of performance of individual crowdsourcedquery operators on the ATM platform and demonstrate thatCrowdDB in indeed able to answer queries that traditionalDBMS cannot.

In [9] designed a wizard that may automatically configurea user’s privacy settings with minimal effort from the userto aim policy preferences learning, to classify users andan automatic access rules affectation to new user. Theydeveloped a tool, an easy to use and provides very simpleuser interactions, it leveraging the machine learning paradigmof active learning, it interactively asks the user to assignprivacy labels to specific, carefully selected, friends. As the

Page 3: Be a Collaborator and a Competitor in Crowdsourcing System · 2017-01-21 · with a probabilistic reasoning framework. In [6], the authors present an hybrid human machine pipeline,

user provides more input, the quality of the classifier improves.

[17] models the relationship strength in on-line socialnetworks, the authors develop an unsupervised model toestimate relationship strength from interaction activity anduser similarity. Them model can represent the full spectrum ofrelationship strength and propose an unsupervised method toinfer a continuous-valued relation-ship strength links since therelationship strength directly impact the nature and frequencyof on-line interactions between a pair of users so impact theprivacy policies of each user.

In [15], the authors introduced privacy protection tool thatmeasures the amount of sensitive information leakage in auser profile and suggest self-sanitization action to regulatethe amount of leakage. The protection tool determines theprobability of individual sensitive attribute inference in auser profile based on her/his friendship relations and friend’sattribute values.

In [4] has concluded the need of more flexible mechanismsof protection, making a user able to decide which networkparticipants are authorized to access his/her resources andpersonal information to enforce privacy policies. Hence,they focused their work on relationship protection byproposing a strategy exploiting cryptographic techniques toenforce a selective dissemination of information concerningrelationships across a social networks.

In [18] the authors introduce the problem of image search.Even when we have to deal with variations in lighting,texture, image quality, the search should be precise andreturn very few erroneous results. They wish to return onevalidate image before deadline, while minimizing the money.To do this, they propose a solution named CrowdSearch, inorder to provide an accurate image search system for mobiledevices by combining an automated Image search and thena human validation based on crowdsourcing that increasessearch result accuracy. The combination presents a complexset of trade-offs involving delay, accuracy and cost.

In [16] they are interested in the problems of finding allduplicate records. They propose an hybrid human machinesystem for entity resolution. They uses machine basedtechniques to discard those pair of records that look verydissimilar and only asks the world to verify the remainingpairs which need human insight. They combine the efficiencyof the machine based approaches with the answer qualityobtained from people. They uses heuristics approach to fightagainst the problem of processing time.

In [5] are interested in the problems of linking entity fromtext to the linked open data cloud. They linked definitionsto words in texts. To do this, they propose the ZenCrowdallowing linking definitions to words in texts. They combineboth algorithmic and manual linking, automate manual linkingvia crowdsourcing and dynamically assess human workerswith a probabilistic reasoning framework.

In [6], the authors present an hybrid human machinepipeline, a novel system used for answering complex keywordqueries. Using the crowd to understand the query (gain knowl-edge of query structure and entity relationships). They developa new language to answer the queries.

[2] develops how to exploit the unique capabilities of thecrowd to mine data from the crowd. They define a method ofaccessing personal databases based on association rules. Thedefinitions of support and confidence are used as a measurefor the significance of a rule per user.

The work in [8] is based on the social networks wherethe proposed systems searches the most suitable worker toanswer the task. It is proposed a model based on the workers’profiles using information on social network platforms. Theauthors developed a native Facebook application named OpenTurk implementing tasks assignment.

In [13] it is introduced the problem of the InteractiveCrowdsourcing, they present the smartcrowd a framework forharnessing the crowd to approximate ground truth effectivelyand efficiently, while taking into account the innate uncer-tainty of human behavior. They formulate task assignment asan optimization problem, and rely on pre-indexing workersand maintain the indexes adaptively, in such a way that thetask assignment process gets optimized both qualitatively,and computation time-wise. They present rigorous theoreticalanalyses of the optimization problem and propose optimal andapproximation algorithms.

Privacy and data leaking are not at all discussed in theseworks.

Contributions:

We address the aforementioned challenges by proposingthe SocialCrowd framework to discover competitive andcollaborative composition teams answering the query througha social network while prohibit data leak between those teams.

The discovering method (DLTD) is a clustering algorithmto classify the potential crowd people from the social networkthat are close to collaborate in the same team. The systemwill not group them in the competitive clusters, thus, avoidinter-teams data leaking.

To handle such leak, the HDPD algorithm, a Markov chain-based algorithm, is proposed to discover the implicit socialrelationships between crowd people.

The rest of the paper is organized as follows: section 2 pro-vides an overview of our SocialCrowd system. The propagationprocess is described in section 3. In section 4 is discussed theclustering based approach to discover competitive clusters inthe social network answering a business call. We discuss ourexperiments in Section V.

II. SocialCrowd: AN OVERVIEW

The SocialCrowd architecture is composed of three maincomponents as depicted in Fig.1:

Page 4: Be a Collaborator and a Competitor in Crowdsourcing System · 2017-01-21 · with a probabilistic reasoning framework. In [6], the authors present an hybrid human machine pipeline,

Data Propagation Process(DPP)

Social network

ClusteringProcess (CP)

TeamsConstitution

Data leakaware

Identitied Teams

Clu

sterin

g

QueryClustering model

Propagation model

Direct Propagation

data

Identitied Members

UserInterface

QueryRegistration

Direct and Implicit relation matrix

Data leak aware

Data leak Background

Fig. 1. SocialCrowd architecture

• Data propagation component:

In order to compose data leak free teams, the datapropagation component computes the propagation ofdata in the social network. The component uses thecollected (inputs) and shared (outputs) information ofeach social member. The proposed data propagationalgorithm HDPD (High Data Propagation Discovery)deals with the dynamic aspects of the network.

It first computes the probabilities of data propagationfrom each member to direct friends in the socialnetwork. Then, it calculates data propagation prob-abilities from each member to indirect friends.It is abased Markov Chain model detailed in next section.

• Clustering component:

Given the data propagation probabilities provided inthe previous step, the clustering component groupscrowd workers1 into competing clusters (or teams)free of data leak. For that purpose, we define adistance to capture data leaks.

Crowd workers with higher propagation probabilityare part of the same cluster. The proposed clusteringalgorithm DLTD (Data Leak free Team Discovery)returns all potential data leak free teams. Details aboutthe DLTD algorithm are given in Section IV.

• Team Constitution:

The module will constitute the different team based onthe provided result from the clustering process (CP).It will use the user profile information provided fromthe social network. It will notify the users about theteam discovering results.

1In the rest of the paper, we use the terms ”member” and ”crowd worker”interchangeably.

III. A MARKOV CHAIN-BASED APPROACH FOR DATAPROPAGATION

The aim of our approach is to build up teams fromthe social networks, all the while avoiding propagation ofdata between competitive clusters. For that, the clusteringalgorithm needs all the possible interactions between membersof a social network.

A Social Network is set of direct relationships betweenmembers. These direct relationships allow us to computethe probability of data propagation (and consequently anydata leak) between only direct friends. However, to discovercompetitive teams that are data leak aware, one needs toknow all possible data propagations from friend-to-friend inthe whole social network.

The Markov chain model is used, because

• in real world the data propagation depends only of thepresent state and not the past one.

• The discovering of the maximum data propagationbetween members is probabilistic.

The data share is an imminent information to identifycompetitive teams (by prohibiting leakage between discoveredclusters) and collaborative teams (by ensuring that themembers in the cluster interact well).

In real social network, data are shared from friend tofriend and so on, so the friendship relation is a good indicatorof data share in our case.

As example, the social network of Fig. 2 shows that Aliceshares 90% of his data with Mickael. This rate is considered asprobability of data propagations between direct friends notedby p(Alice,Mickael) = 0.9. However, the data propagationto indirect friends (to friends-of-friends) are implicit. Theydepend on data propagation probabilities between directfriends. In the Fig. 2, the probability of data propagation fromAlice to indirect member David is computed by the expression:

p(Alice,David) = p(Alice,Mickael) × p(Mickael,David) =0.9× 0.5 = 0.45.

All the possible paths in the social network have to be exploredto compute data propagation between two members. This is ahard computing task as real social networks are quite complex.

In this section, we present a Markov chain-based approachfor handling the implicit/indirect relations between members.We present a model and an algorithm of data propagationacross the entire social network.

A. A graph-based model of data sharing relationships

In this paper, we model the relationships in the socialnetwork without focusing on their nature (e.g., friendship,sharing, etc.). We only need to reason on the quantity ofdata propagated between members in the social network. Thus,

Page 5: Be a Collaborator and a Competitor in Crowdsourcing System · 2017-01-21 · with a probabilistic reasoning framework. In [6], the authors present an hybrid human machine pipeline,

Fig. 2. A simple data propagation

hereafter we denote all kinds of relationships by the friendshiprelation.

We model the network as a labeled directed graphG 〈M,A,P 〉 where :

• M = mi: set of nodes where each node representsa social network member.

• A = aij = (mi,mj)/(mi,mj) ∈M: set of edgeswhere each edge represents a relationship between twomembers.

• P = pij/∀i, j pij ∈ [0, 1]: is a set of labelswhere each label pij of the edge aij represents therate/probability of data shared by the member mi ∈Mwith his friend member mj ∈M .

In the given graph model, the friend relationship is repre-sented using edge A labeled with the real probability of shareddata P . We calculate the probability pij that the member mi

shares his data with member mj by the simple formula:

pij =quantity of data that mi shared with mj

quantity of data held by mi

where we consider the data held by the member mi as onlythe data that mi received from other direct members in the so-cial network. For instance in Fig. 2, the arc (Alice,Mickael)indicates that Alice has a relationship with Mickael, andshares with him 90% of his data and the arc (Alice,George)indicates that Alice has a relationship with George but shenever shares any data with him.

In the following, we present the Markov propagation modelto compute the implicit probability of data propagation be-tween indirect friends (e.g., the propagation probability fromAlice to David in Fig. 2 ).

B. Markov chain-based model for data propagation

Given an owned data of a known member, the aim of thissection is to compute the propagation probabilities to all thepossible members of the social network. The premise is that insocial networks data is propagated from one friend to anotherand so on. This observation about propagation of owned datafrom friend to friend corresponds to the well known Markovchain [11].

Definition 1: (Markov chain)

A Markov chain is a sequence of random variablesX1, ..., Xn with the Markov property; that the future statedepends only on the the present state, and not on the paststates. Formally,

P (Xn+1 = x|X1 = x1, X2 = x2, . . . , Xn = xn) =P (Xn+1 = x|Xn = xn)

From this formal definition, the probability that a givenmember gets some data depends on the probability to get itfrom only his direct friends (and not from indirect friends).

We represent the probability of data propagation betweendirect friends as a Propagation Matrix (defined as follows).

Definition 2: (Propagation matrix)

The Propagation Matrix of a social network G 〈M,A,P 〉is a matrix that gives the probability pij of propagating databetween each couple of friend members (mi,mj), where

pij =

quantity of data that mi shared with mj

quantity of data held by miif (mi,mj) ∈ A

1 for i = j0 else

This propagation matrix has the following properties:

• pii = 1 means that member mi does not lose theowned due to sharing.

•∑k∈[1,n] (pik) 6= 1, as data may be propagated to

several members at the same time.

• pij 6= pji, which means that a member mi may sharedata with a friend mj in a quantity that is differentfrom what his friend mj may share with him.

• (mi,mj) ∈ A; pij 6= 0, which means that membersdo not necessarily share their data with their friends.

• (mi,mj) /∈ A ⇒ pij = 0, to express that datapropagation to indirect friends in the social networkis unknown.

Based on the above mentioned definitions, the propagationmatrix of the social network of Fig. 2 is given as follows:

Page 6: Be a Collaborator and a Competitor in Crowdsourcing System · 2017-01-21 · with a probabilistic reasoning framework. In [6], the authors present an hybrid human machine pipeline,

Bob Alice John David Mickael George

Bob 1 0 0 0.1 0.1 0.7Alice 0 1 0 0 0.9 0John 0 0 1 0.2 0 0David 0 0 0.2 1 0, 5 0Mickael 0 0.9 0.3 0.5 1 0.2George 0.7 0.3 0.1 0 0 1

C. A Markov chain-based algorithm of data propagation

The propagation matrix of Section III-A gives only theprobability of data propagation between direct friends. Butthe matrix is not sufficient to compute the probability thatdata is propagated from one member to other indirect-friend(s)because:

1) The propagation probability between direct friendsis not optimal. Following different paths in the graphof the social network, we can calculate differentprobabilities of propagating data between twomembers.

In Fig. 2, the direct friend relationship from Alice toGeorge indicates that Alice’s data is not propagatedto George as p(Alice,George) = 0. However, throughMickael, Alice’s data may be propagated to Georgewith probability 0.9× 0.2 = 0.18.

2) Propagation to indirect-friends is not given in thepropagation matrix and is set to zero. The probabilityof sharing data with indirect friends is zero in thepropagation matrix because members share theirdata only with direct friends. However, data may bepropagated from direct friends to indirect-friends.

In the propagation matrix corresponding to Fig. 2,probability that George’s data will be propagatedto indirect-friend David is zero because George isnot a direct friend. However, George’s data may bepropagated to David through Bob with probability0.7× 0.1 = 0.07.

3) Computing propagation to indirect-friends is hard.The probability that Alice’s data may be propagatedto David (probability of dotted red arrow in Fig. 2)is hard to compute because it requires exploring allthe propagation paths from Alice to David (dottedgreen arrows in Fig. 2 - although the graph of Fig.2 is small, there are seven paths).

All the possible paths in the social network have tobe explored to compute data propagation between twomembers. This process is not straight-forward as realsocial networks are quite complex.

In this regard, we propose an algorithm to calculate theoptimal probability of data propagation from a given memberto all the members of the social network. This algorithm isbased on an energy function we define as follows:

Definition 3: (Energy function)

The energy function pi is the probability that data ispropagated to the member mi. Since in our model data ispropagated following a Markov chain:

pi = Maxmk∈Nmi

(pk × pki) (1)

where,Nmi

is a set of direct friends of mi, pk is the energy functionof mk and pki is the probability of propagating data from mk

to mi.

The maximum function is the suitable aggregation functionas the aim of the approach is to estimate the maximum risk ofdata propagation. The aim is to compute the data probabilitypropagation pij and pji of all the pairs of members (mi,mj).This requires the use of an iterative algorithm. The pro-posed HDPD algorithm computes the optimal energy functionsPo∗ = (po1, po2, ..., pon), which represent probabilities of datapropagation from the member mo to members mii∈[1,n]. Itprocesses as follows:

• Initialisation:powner = 1,∀i 6= owner pi = 0. That isonly the owner has the data.

• Iterations: At each iteration, the algorithm computespi for mi ∈ Nmiusing the energy function.

• Stops: The algorithm stops when the maximum prob-ability of each member is reached.

The algorithm is recursive, the complexity of the algorithm isthe main call. Given, n the number of registered members, mthe number of relations and f(n) the main call propagationfunction complexity. This function is composed of a test ofcomplexity O (n) and it makes n recursive call. Then, thecomplexity of this function is : O(n ∗m)× n.

Our algorithm is presented as follows:

Page 7: Be a Collaborator and a Competitor in Crowdsourcing System · 2017-01-21 · with a probabilistic reasoning framework. In [6], the authors present an hybrid human machine pipeline,

Algorithm 1 HIGH DATA PROPAGATION DISCOVERY(HDPD)Require: G 〈M,A,PM〉 – labeled directed graph of the so-

cial network where PM is the propagation matrixmo – owner of the data

Ensure: P – all the implicit data propagation from mo

1: P = (p1, . . . , po, . . . , pn) – Energy function at the currentstep.

2: PS = (ps1, . . . , pso, . . . , psn) – Energy function at thenext step.

3: continue: boolean value indicating if the optimal proba-bilities are reached.Initializations

4: po = 1 and ∀i 6= o, pi = 05: continue← trueIterations

6: while continue do7: for each members mi 6= mo do8: psi = Maxmk∈Nmi

(pk × pki))9: end for

10: if P 6= PS then11: pi ←Max (psi, pi)12: else13: continue← false14: end if15: end while16: return P

Running the algorithm on the propagation matrix of previ-ous subsection III-B results in the following graph (but onlysome links are shown for readability):

Fig. 3. Data propagation graph with implicit relationships

This graph shows that:

• For indirect friend relationships such as(Alice,David) and (Alice, John), the algorithm

calculated data propagation probabilities that do notexist in the initial graph of Fig. 2.

• For some direct friend relationships such as(Alice,George) and (Bob,Mickael), the algorithmupdates the data propagation probabilities from 0and 0.1 to the optimal values 0.18 and 0.19 re-spectively. These optimal probabilities are calcu-lated using the paths (Alice,Mickael,George) and(Bob,George,Alice,Mickael), respectively.

IV. DATA LEAK AWARE CLUSTERING

In this section, we provide details on the clustering processthat groups crowd members into data leak free clusters, basedon the data propagation technique defined in the previoussection. In order, to give chance to every registered memberin the competition, the clustering process discover all possiblesolutions including each member only to one cluster. Incontrast to existing clustering algorithms (such as k-means[12] and our previous presented algorithm [1]) that generate asingle clustering solution, our proposed approach generates allpossible clustering solutions while preserving their respectivedata.

Before presenting the algorithm details, we give the fol-lowing definitions:

Definition 4: (Cluster)

A cluster C is set of crowd members (having propagation,or no strong propagation):

C = mi such that ∀mi,mj ∈ C, pij ∈ [0, 1], pji ∈ [0, 1]

Definition 5: (Data leak free clusters)

Two clusters Ck and Cs are data leak free iff:

∀mi ∈ Ck,∀mj ∈ Cs, pij ≤ η ∧ pji ≤ ηIn definition 5, we consider there is a risk of data leak

between two clusters Ck and Cs if the propagation ratebetween all the members of the two clusters is less thana threshold η. The clustering algorithm uses the followingdefinition (of distance) to generate data leak free clusters:

Definition 6: (Distance)

The data propagation between a member mi and a clusterCk is measured by the distance Ω(Ck,mi), defined as follows:

Ω(Ck,mi) = Maxmj∈Ck

pij

It follows from definitions 5 and 6 that there is a data leakfrom a cluster Ck to member mi if: Ω(Ck,mi) > η.

Definition 7: (Clustering solution)

A solution S = C1, ..., Cnn≥1 is set of clusters suchthat:

∀i, j ∈ [1, n] Ci and Cj are data leak free.

Page 8: Be a Collaborator and a Competitor in Crowdsourcing System · 2017-01-21 · with a probabilistic reasoning framework. In [6], the authors present an hybrid human machine pipeline,

Note that a solution is incomplete if it does not contain all themembers, and complete otherwise.

Using definition 5, we transform the propagation matrix,computed by the HDPD algorithm, into a symmetric matrix asfollows:

∀i, j, pij = Max(pij , pji)

The propagation matrix will be modified such as consid-ering now an undirected graph and update the relationshipsbetween two users with the high value of their propagation.The maximum will be the criteria used to avoid data leakingbetween clusters.

As mentioned earlier, the Data Leak free Team Discoveryalgorithm (DLTD) will generate all the possible clustering so-lutions based on the defined symmetric matrix. The processesare as follows:

Initialization:

• A random member m0 is selected, and made themember of a new cluster C1. That is, C1 = m0

• The initial partial solution is created as S1 = C1.

Using the partial solution(s) Si = C1, ..., Cnn≥1 (i is thenumber of clustered members in Si), the algorithm clusters anew member mi according to the following rules:

Rule 1: If mi has data leak with only one cluster Ck inSi (i.e., ∃Ck ∈ SiΩ(Ck,mi) > η and ∀Cjj 6=k ∈Si,Ω(Cj ,mi) ≤ η), then the member mi is clusteredinto Ck. The partial solution Si is increased to Si+1

such that Si+1 = Si where Ck = Ck ∪ mi.Rule 2: If mi has no data leak with any cluster in Si

(i.e., ∀Cj ∈ Si,Ω(Cj ,mi) > η), then the memberis clustered in each cluster while creating a newsolution for each instance.

That is, Si increases to n possible partial solutionsSαi+1: Sαi+1 = Si ∪ Cα

where Cα = Cα ∪ mi for α ∈ [1, n].

Rule 3: If mi has data leak with two or more clusters in Si,i.e., for SCi in Si, SCi = C1, ...Ck2≤k≤n suchthat ∀Cj ∈ SCi,Ω(Cj ,mi) > η, then there are twopossible solutions (S1

i+1 and S2i+1):

(1)Merge the clusters C1, ...Ck and mi into onecluster

⋃i=1,k

Ci ∪ mi and The partial solution Si is

increased to S1i+1

(2) Compute the subset L ⊆⋃

Ci∈SCi

Ci such that

members of L∪mi have direct or indirect data leakbetween them and Create a new cluster Cg = L∪miand remove members of L from their respective clus-ter finally the partial solution Si increased to S2

i+1

S2i+1 = Si ∪ Cg.

The DLTD clustering algorithm applies the above men-tioned defined rules until all members are clustered in

all possible solutions. The algorithm is traced as fol-lows:

Algorithm 2 DATA LEAK FREE TEAM DISCOVERY

Input: The symmetric propagation matrix, M : set of all crowdmembers to be clustered.Output: All possible solutions Si of data free clustering.Initializationsrandomly select m0 from M;create the cluster C1 = m0;S1 = C1 call FreeLeakClustering (S1,M −m0); Clusteringfunction Function FreeLeakClustering(Solution Si, MembersM);if M = ∅ then

Return Si;\Si is a final solution as there are no remaining membersto cluster

end ifif Rule1: mi has data leak with only one cluster Ck in Si thenCk = Ck ∪mi

Sji+1 = Sicall FreeLeakClustering(Si+1,M −mi);

end ifif Rule2: mi has no data leak with any cluster in Si then

for each cluster Cj in Si doCj = Cj ∪mi

Sji+1 = Sicall FreeLeakClustering(Sji+1,M −mi);

end forend ifif Rule3: mi has data leak with two or more cluster SCi inSi, SCi = C1, . . . , Ck2≤k≤n ⊆ Si then\ First alternative solution Merge all the clusters of SCiwith mi into the new cluster Cg . that is,Cg = C1 ∪ . . . ∪ Ck ∪mi;call FreeLeakClustering(S1

i+1,M −mi);\ Second alternative solution compute the subset L ⊆⋃Ci∈ SCi Ci of members having direct or indirect data

leak between them ’see Rule3.2);Create a new cluster Cg = L

⋃mi;

for mj in L doDelete mj from its cluster in Si;

end forS2i+1 = Si

⋃Cg

call FreeLeakClustering(S2i+1,M −mi);

end if

As example, Fig. 4 shows the resolution graph generatedby DLTD clustering algorithm when running on the graph ofFig. 2. Internal nodes (S∗i , i ≤ 5) are partial solutions and leafnodes (S∗6 ) are final solutions as all members are clustered(there are 6 members to be clustered for this example).

The graph shows five solutions of clustering. In eachsolution, members of the social network are grouped intodata leak free clusters, which can be easily checked usingsymmetric propagation matrix given at the beginning of thissection.

Page 9: Be a Collaborator and a Competitor in Crowdsourcing System · 2017-01-21 · with a probabilistic reasoning framework. In [6], the authors present an hybrid human machine pipeline,

Fig. 4. Resolution graph generated by the clustering algorithm

The algorithm is recursive and consists of three mainrules: Rule 1, Rule 2 and Rule 3. The complexity of thealgorithm is the main call. Given, n the number of membersto be classified, and f(n) the main call clustering functioncomplexity.

Then, f(n) = Max(O(Rule1(n)), O(Rules2(n)), O(Rule3(n)).

1) Complexity of rule 1 : This rule is composed of atest of complexity O(n), and a recursive call.

Then, the complexity of this rule is :O(n) +O(f(n−1)) = n+ (n−1) +O(f(n−2)) =n+ (n− 1) + . . .+O(1) = O(n2).

2) Complexity of rule 2 : This rule makes k recursivecall where k is the number of clusters in a solution.

Then, the complexity of this rule is :O(n)+k×O(f(n−1)) = n+k×(k−1)× . . .×1 =O(n+ k!).

If we consider the theoretical worst case k = n, thencomplexity of rule 2 is O(n!).

3) Complexity of rule 3 : This rule is composed of atest of complexity O (n), and two recursive calls.

Then, the complexity of this rule is:O(n)+2×O(f(n−1)) = n+2(n−1)+O(f(n−2)) =n+ 2(n− 1) + . . .+ 2 = O(n2).

Hence, the overall complexity of the algorithm is O(n!). Thisis a theoretical complexity in the worst case. In practice, thenumber of clusters k are usually much smaller than n.

V. EXPERIMENTS

The experiment environment consists of a Mac OS X10.5.8 machine with a 2.13 GHz Intel Core 2 Duo processor,and 2 GB of ram.

Coding is JAVA-based, and we have followed Metcalfe’slaw 2 for creating the social network.

Metcalfe’s law characterizes many of the network effectsof communication technologies and networks such as theInternet, social networking, and the World Wide Web. FormerChairman of the U.S. Federal Communications Commission,Reed Hundt, said that this law gives the most understandingto the workings of the Internet.

Metcalfe’s Law is related to the fact that the number ofunique connections in a network of a number of nodes (n)can be expressed mathematically as the triangular number n(n- 1)/2, which is proportional to n2 asymptotically.

The law is abundant and existent due to the ability ofInternet users to link together. Websites and blogs such asTwitter, Facebook, and Myspace are the center of this lawtaking effect.

We conduct experiments on effectiveness of the proposedapproach in terms of data leak comparing to K-means.K-means clustering is a well known partitioning method. Inthis objects are classified as belonging to one of K-groups.

The results of partitioning method are a set of K clusters,each object of data set belonging to one cluster. In each clusterthere may be a centroid or a cluster representative.

A. k-means and DLTD clustering based on Ω

We present the evaluation of the proposed DLTD algorithmand compare it to the k-means clustering algorithm, based onthe Ω distance.

The data leakage is calculated based on the results returnedby the two algorithms. We explored each cluster in orderto identify the data propagation upper than η between two

2Bob Metcalfe, personal communication, June 2007.

Page 10: Be a Collaborator and a Competitor in Crowdsourcing System · 2017-01-21 · with a probabilistic reasoning framework. In [6], the authors present an hybrid human machine pipeline,

members in different clusters. To do this, we developed a scriptcomputing the leakage after the classification step.

The results are depicted in Fig. 5 (where the leakagethreshold (η) is set to 40%). The figure shows that with theDLTD algorithm, the clusters are free data leak, even if, thesize of the social network increases. These results are soundbecause the algorithm groups members leading to the dataleak in the same group. In contrast, the k-means algorithmresults in a higher percentage of data leaks with increasingsocial network size. This result is valid with varying numberof clusters (k).

For instance, with 20 clusters we can see that data leakincreases from 2.9% to 4.4% when the size of the socialnetwork grows from 500 to 5000 members. These resultsconfirm that the k-means algorithm is not a good solution forhandling data leaks in social networks.

In fact, even if there is high data propagation between twonetwork members, the k-means algorithm may cluster theminto different clusters due to their respective closeness in termsof the used distance.

500 1000 1500 2000 2500 3000 3500 4000 4500 5000

7

6

5

4

3

2

0

Members

Pe

rce

nta

ge o

f d

ata

lea

k

Fig. 5. Leakage comparison

B. Execution time comparison

We analysed the execution time of both DLTD en K-means algorithm using Ω distance. The analysis uses between1000 and 5000 users. We computed the execution time ofaround 12497500 relationships in social network. We notethat, the DLTD algorithm has no data leakage, however, itscomputational time is upper than the K-means.

In the figure 6, the execution time is between 153 and 1653seconds for the k-means algorithm and it’s between 172 and2631 for the DLTD. Then, the K-means algorithm is betterthan the DLTD in computational time.

Fig. 6. DLTD and K-means execution time

The execution time of the DLTD algorithm is importantthan the K-means, because, the DLTD discover all possibleclustering solutions, or the K-means discover only one.

Furthermore, the k-means results represent an importantdata leakage (data propagation upper than η) betweenthe different clusters while using Ω distance, because, theclassification mechanism group members in the nearest cluster.

This means that, k-means classify the member to thecluster with which he has the highest data propagation, evenif, he has other propagations upper than the threshold withother clusters.

The execution time of the k-means algorithm is causedby the convergence process. The k-means repeating thesame classification phase until convergence. The number ofrepetition is variable, in our case, it depends on the numberof members and their propagation to classify data.

The convergence of k-means is reached when comparingthe last two iterations shows that members no longer movebetween teams, then, the calculation of the k-means reachesits stability and no iteration is required. In addition, theconvergence to a local minimum can produce a bad result.

The convergence criterion of DLTD algorithm is reachedwhen there is no crowd member to classify.

C. Solution generation scalability

In this experiment, we shows that the DLTD algorithmreturns different solutions generation depending on thenumber of registered member and the data propagation valuesand not depending on social network size.

The figure displays that more the of registered membergrows more the number of generated solutions increasesexcept from 1500 to 2000 members, because, members thathas a data propagation upper than the threshold with two or

Page 11: Be a Collaborator and a Competitor in Crowdsourcing System · 2017-01-21 · with a probabilistic reasoning framework. In [6], the authors present an hybrid human machine pipeline,

more clusters generates merging those clusters.

Fig. 7. Solution generation scalability

Hence, there will be less solutions. Using the same socialnetwork architecture and evolving the number of registeredmember, the fig 7 shows that, with 500 registered member, theDLTD algorithm return around 33 possible solutions, and for3000 members the algorithm generate around 125 solutions.

VI. CONCLUSION AND FUTURE WORK

In this paper, we proposed a data leak aware crowdsourcingsystem providing a collaborative environment between crowd-experts organized in various teams that compete against eachother to achieve certain tasks. The system prevents data leaksbetween teams to protect each team competitive advantage.

For that, we designed a clustering algorithm that discoversall possible teams of crowd workers while minimizing dataleakage. Where the leakage is expressed by propagation ofdata between crowd workers via social relationships.

However, the theoretical complexity of the clustering algo-rithm is hard. To face with this complexity, we plan to improvealgorithm by adding heuristics.

REFERENCES

[1] Iheb Ben Amor, Salima Benbernou, Mourad Ouziri, Mohamed Nadif,and Athman Bouguettaya. Data leak aware crowdsourcing in socialnetwork. In WISE Workshops, pages 226–236, 2012.

[2] Yael Amsterdamer, Yael Grossman, Tova Milo, and Pierre Senellart.Crowd mining. In Proceedings of the 2013 ACM SIGMOD InternationalConference on Management of Data, SIGMOD ’13, pages 241–252,New York, NY, USA, 2013. ACM.

[3] Daren C. Brabham. Crowdsourcing as a model for problem solving:An introduction and cases. Convergence: The International Journal ofResearch into New Media Technologies, 14(1):75–90, 2008.

[4] Barbara Carminati, Elena Ferrari, and Andrea Perego. Enforcing accesscontrol in web-based social networks. ACM Trans. Inf. Syst. Secur.,13(1), 2009.

[5] Gianluca Demartini, Djellel Eddine Difallah, and Philippe Cudre-Mauroux. Zencrowd: leveraging probabilistic reasoning and crowd-sourcing techniques for large-scale entity linking. In WWW, pages469–478, 2012.

[6] Gianluca Demartini, Beth Trushkowsky, Tim Kraska, and Michael J.Franklin. Crowdq: Crowdsourced query understanding. In CIDR, 2013.

[7] Daniel Deutch, Ohad Greenshpan, Boris Kostenko, and Tova Milo.Using markov chain monte carlo to play trivia. In ICDE, pages 1308–1311, 2011.

[8] Djellel Eddine Difallah, Gianluca Demartini, and Philippe Cudre-Mauroux. Pick-a-crowd: tell me what you like, and i’ll tell you whatto do. In WWW, pages 367–374, 2013.

[9] Lujun Fang, Heedo Kim, Kristen LeFevre, and Aaron Tami. A privacyrecommendation wizard for users of social networking sites. In ACMConference on Computer and Communications Security, pages 630–632, 2010.

[10] Michael J. Franklin, Donald Kossmann, Tim Kraska, Sukriti Ramesh,and Reynold Xin. Crowddb: answering queries with crowdsourcing.In Timos K. Sellis, Renee J. Miller, Anastasios Kementsietsidis, andYannis Velegrakis, editors, SIGMOD Conference, pages 61–72. ACM,2011.

[11] Olle Hggstrm. Finite markov chains and algorithmic applications. Inin London Mathematical Society Student Texts. Cambridge UniversityPress, 2000.

[12] J. MacQueen et al. Some methods for classification and analysis ofmultivariate observations. In Proceedings of the fifth Berkeley sympo-sium on mathematical statistics and probability, volume 1, page 14.California, USA, 1967.

[13] Senjuti Basu Roy, Ioanna Lykourentzou, Saravanan Thirumuruganathan,Sihem Amer-Yahia, and Gautam Das. Optimization in knowledge-intensive crowdsourcing. CoRR, abs/1401.1302, 2014.

[14] Florian Skopik, Daniel Schall, Harald Psaier, Martin Treiber, andSchahram Dustdar. Towards social crowd environments using service-oriented architectures. it - Information Technology, 53(3):108–116,2011.

[15] Nilothpal Talukder, Mourad Ouzzani, Ahmed K. Elmagarmid, HazemElmeleegy, and Mohamed Yakout. Privometer: Privacy protection insocial networks. In ICDE Workshops, pages 266–269, 2010.

[16] Jiannan Wang, Tim Kraska, Michael J. Franklin, and Jianhua Feng.Crowder: Crowdsourcing entity resolution. Proc. VLDB Endow.,5(11):1483–1494, July 2012.

[17] Rongjing Xiang, Jennifer Neville, and Monica Rogati. Modelingrelationship strength in online social networks. In WWW, pages 981–990, 2010.

[18] Tingxin Yan, Vikas Kumar, and Deepak Ganesan. Crowdsearch: Ex-ploiting crowds for accurate real-time image search on mobile phones.In Proceedings of the 8th International Conference on Mobile Systems,Applications, and Services, MobiSys ’10, pages 77–90, New York, NY,USA, 2010. ACM.