Set similarity join on massive probabilistic data using MapReduceidke.ruc.edu.cn/publications/2014/Set similarity join on... · 2018-03-21 · reduce.Map task is responsible for dealing

Distrib Parallel Databases (2014) 32:447–464DOI 10.1007/s10619-013-7137-3

Set similarity join on massive probabilistic data usingMapReduce

Youzhong Ma · Xiaofeng Meng

Published online: 3 December 2013© Springer Science+Business Media New York 2013

Abstract In this paper, we focus on set similarity join on massive probabilistic datausing MapReduce, there is no effective approach that can process this problem effi-ciently. MapReduce is a popular paradigm that can process large volume data moreefficiently, in this paper, we proposed two approaches using MapReduce to deal withthis task: Hadoop Join by Map Side Pruning and Hadoop Join by Reduce Side Prun-ing. Hadoop Join by Map Side Pruning uses the sum of the existence probability tofilter out the probabilistic sets directly at the Map task side which have no any chanceto be similar with any other probabilistic set. Hadoop Join by Reduce Side Prun-ing uses probability sum based pruning principle and probability upper bound basedpruning principle to reduce the candidate pairs at Reduce task side, it can save thecomparison cost. Based on the above approaches, we proposed a hybrid solution thatemploys both Map-side and Reduce-side pruning methods. Finally we implementedthe above approaches on Hadoop-0.20.2 and performed comprehensive experimentsto their performance, we also test the speedup ratio compared with the naive method:Block Nested Loop Join. The experiment results show that our approaches have muchbetter performance than that of Block Nested Loop Join and also have good scal-ability. To the best of our knowledge, this is the first work to try to deal with setsimilarity join on massive probabilistic data problem using MapReduce paradigm,and the approaches proposed in this paper provide a new way to process the massiveprobabilistic data.

Keywords Set similarity join · MapReduce · Probabilistic data

Communicated by Feifei Li and Suman Nath.

Y. Ma (B) · X. MengSchool of Information, Renmin University of China, Beijing, Chinae-mail: [email protected]

X. Menge-mail: [email protected]

mailto:[email protected]

mailto:[email protected]

448 Distrib Parallel Databases (2014) 32:447–464

1 Introduction

Set similarity join plays important role in many real word applications that requirefinding out the similar pairs of records which contain string or set-based data, suchas near duplicate web pages detection [14], data integration [11], document cluster-ing [7], and so on. For example, in the application of data integration, the similarauthor names or similar paper titles can be detected and merged based on set simi-larity of tokens. In fact we have to deal with the uncertain data during the integra-tion procedure. There are two reasons for uncertainty [11]: one is that the data isalways extracted from unstructured or semi-structured sources using automatic orsemi-automatic methods (e.g., HTML pages, XML pages or emails); another is thatthe data may come from different sources that are unreliable or not up to date.

Set similarity join on probabilistic data is becoming more and more challengingtoday: on the one hand it is complex and computation intensive, on the other hand, thevolume of the data is always so large in many applications that it can not be finished inone machine. For the data and computation intensive applications, MapReduce [10]has received more and more attention as a powerful framework that can deal withlarge scale data efficiently. In this paper, we try to use MapReduce paradigm to dealwith the set similarity join problem on large scale probabilistic data. Based on thedetailed analysis of the specific features of the probabilistic set, we proposed somenovel methods to save the network communication cost and CPU cost: we proposedcombined prefix filtering principle to reduce the candidate pairs, we also proposed“prefix of prefix” technique to avoid the duplicated comparisons for the same setpair. In summary, the contributions of the paper are as follows:

– We propose three approaches: Hadoop Join by Map Side Pruning, Hadoop Join byReduce Side Pruning and Hadoop Join by Hybrid Pruning to deal with set similar-ity join on massive probabilistic data efficiently using MapReduce paradigm;

– We propose combined prefix filtering principle to reduce the candidate probabilis-tic sets and save the network communication cost;

– We perform comprehensive experiments to test the performance of our proposedapproaches, the experiments results show that the performance of our proposedsolutions is much better than that of Block Nested Loop Join.

The rest of the paper is organized as follows. In Sect. 2 we introduce the relatedworks. We give the problem definition in Sect. 3. Then we talk about the baselinemethod in Sect. 4. In Sect. 5, we mainly introduce the probability based prefix-filtering principle and propose three new approaches. We perform comprehensiveexperiments in Sect. 6 and Sect. 7 concludes this paper.

2 Related work

In this section we mainly introduce the related works that include the overview of theMapReduce framework, set similarity join on certain data and set similarity join onuncertain data.

Distrib Parallel Databases (2014) 32:447–464 449

Fig. 1 Overview ofMapReduce

2.1 MapReduce overview

MapReduce [10] was firstly proposed by Google in 2004, now it has become avery popular and powerful framework for large scale data analytics. The MapRe-duce paradigm is very simple, a MapReduce job contains only two tasks: map andreduce.Map task is responsible for dealing with each input key-value pair (k1, v1),and emits a list of intermediate key-value pairs list(k2, v2). Reduce task takes anew key-value pair (k2, list(v2)) as input and produce another list of key-value pairslist(k3, v3). The input and output records can be stored in a distributed file system,such as Google File System [13]. The overview of the MapReduce architecture isillustrated in Fig. 1 [20].

Many research works have been done to extend the existing algorithms(simple orcomplex) using MapReduce. [2, 21, 24] extends the MapReduce paradigm to dealwith the traditional join operation like in DBMS. Okcan et al. [21] mainly focused onprocessing theta-joins using MapReduce; Afrati et al. [2] proposed some methods tooptimize the multi-way join under MapReduce environment; Yang et al. [24] madesome modifications based on the original MapReduce framework making sure thatthe modified MapReduce framework can support join operations more efficiently.Other works focus on dealing with complex join operations using MapReduce, suchas fuzzy joins [1], efficient similarity joins on massive high-dimensional datasets [20]and top-k similarity join [16].

2.2 Set similarity join on certain data

Set similarity join on a single machine has been well studied in the literatures [3, 6, 8].While set similarity join on large scale data is computation intensive, so several workshave been done to try to resolve this problem using MapReduce paradigm [5, 12, 22].Elsayed et al. [12] implemented the Full-Filtering approach using MapReduce. Theauthors proposed an algorithm to compute the similarity of candidate document pair,the algorithm adopted two MapReduce jobs, the first job is used to build the invertedindex for all the documents and the second job is used to compute the similarities ofthe document pairs. The document pairs that don’t share any term are never evalu-ated, but the pairs that share at least one term have to be evaluated, even though mostof the pairs are not similar at all. Vernica et al. [22] proposed a MapReduce algorithmbased on Prefix-Filtering principle [23]. For each term t in the signature Sig(di) ofa document di , the Map function emits a key-value pair 〈t, di〉: the term t is used


as the key and the whole document di is used as the value. At the Reduce side, allthe documents sharing a common term are grouped together, the reducers find outthe similar pairs in each group by using the techniques in [23]. The disadvantageof the approach is that two documents are possible to be evaluated many times. Forexample,if prefix of di and prefix of dj have common terms {t1, t2, t3}, the di anddj need to be evaluated three times, so that is time cost. In order to overcome theweak points of the above two algorithms Baraglia et al. [5] proposed two algorithmsbased on Prefix-Filtering technique: Double-Pass MapReduce Prefix-Filterig(SSJ-2)and Double-Pass MapReuce Prefix-Filtering with Remainder File(SSJ-2R), the ex-periments results show that the performance of SSJ-2 and SSJ-2R is better than thatof the above two approaches.

2.3 Set similarity join on uncertain data

Many research works [4, 6, 8] have been done on set similarity join focusing on thejoin over certain sets, where each set is assumed to be precisely known. There arealso some existing works on join over uncertain databases [9, 18]. While the under-ling uncertain database is assumed to contain numerical data, not the set data. So theabove approaches can not be used to solve the probabilistic set similarity join prob-lem directly. Up to now, just few works have a try to exploit this problem. Jestes etal. [15] focused on the probabilistic string similarity join with the expected Edit dis-tance, the authors used probabilistic q-gram to improve the effect of pruning. Lian etal. [19] was the first to exploit the set similarity join on probabilistic data problem,they proposed jaccard distance pruning and probability upper bound pruning princi-ples to filter the candidate pairs, they also used M-tree to index the probabilistic databased on the above pruning methods. The experiments results show that the perfor-mance of [19] is much better. But the authors didn’t consider the index creation cost,actually the cost is heavy.

3 Problem definition

In this section, we mainly give the definition of probabilistic set similarity join, thedefinition is mainly referenced from Lian et al. [19].

3.1 Set-level probabilistic set database

Table 1 represents an example of a set-level probabilistic set database, in whichthere are two probabilistic sets r1 and r2. For example, r1 has three set instancesr11 = {A,B,C,D}, r12 = {A,C,D,E} and r13 = {B,C,D,E}, their existence prob-abilities are r11.p = 0.3, r12.p = 0.3 and r13.p = 0.2 respectively, symbols A, B , C,D and E are the set elements.

From the above example we can see that a set-level probabilistic set databaseRP always consists of a number of probabilistic sets, denoted as ri . Each ri can berepresented as ni set instances ri1, ri2, . . . , and rini

. All the set instances rik (for any1 ≤ k ≤ ni ) of a probabilistic set ri are mutually exclusive (i.e. they cannot appearat the same time in the real world); each instance rik is associated with an existenceprobability rik.p ∈ (0;1], where

∑ni

k=1 rik.p ≤ 1.


Table 1 Set-level probabilisticset database Probabilistic

set ri

Set instance rik Existence prob.rik .p

r1 r11 = {A,B,C,D} 0.3

r12 = {A,C,D,E} 0.3

r13 = {B,C,D,E} 0.2

r2 r21 = {A,B,E,F } 0.4

r22 = {B,C,D,E} 0.3

3.2 Probabilistic set similarity join definition

Definition 1 (Probabilistic Set Similarity Join, PS2J ) Given two probabilistic setdatabase RP and SP , a similarity threshold γ ∈ (0,1] and a probabilistic thresholdα ∈ (0,1], a probabilistic set similarity join is to find all the pairs (ri , sj ) from RP

and SP with probability greater than or equal to threshold α, that is:

PS2J = {〈ri , sj 〉|ri ∈ RP and sj ∈ SP and Pr{sim(ri , sj ) ≥ γ

} ≥ α}

(1)

where sim(., .) is a similarity function between two sets.

The similarity function sim(., .) in Eq. (1) may have many choices, such as Jaccardsimilarity, cosine similarity, overlap similarity and so on, the choice of sim(., .) de-pends on the real application. While the above 3 measures are inter-related and canbe converted into each other through some variation. For simplicity we only focus onone popular set similarity measure, Jaccard similarity:

sim(x, y) = J (x, y) = |x ∩ y||x ∪ y| (2)

Lemma 1 (Probability Computation on the Set Level) The probability computationof Pr{sim(ri , sj ) ≥ γ } on the set level in Eq. (2) can be simplified as:

Pr{sim(ri , sj ) ≥ γ

} =∑

∀r ′∈ri

∑

∀s′∈sj

r ′.p · s′.p · χ(sim

(r ′, s′) ≥ γ

)(3)

where r ′ and s′ are set instances of ri and rj respectively. χ(sim(r ′, s′) ≥ γ ) is aboolean function.

χ(sim

(r ′, s′) ≥ γ

) ={

1 if sim(r ′, s′) ≥ γ is true;0 otherwise.

(4)

The proof. can be referenced from [19].


Fig. 2 Block nested loop join

4 Baseline methods

In this section, we will introduce the procedure of the baseline method: Block NestedLoop Join, and give a detailed analysis about the communication and computationcost of the method.

4.1 Block nested loop join

The baseline method for set similarity join on probabilistic data in MapReduce is toadopt block nested loop join methodology. The basic procedure is like the follows:dividing the original data R into m equal-sized blocks in the Map phase, each blockcontains about |R|/m records, which can be finished in a linear scan of R. Then ev-ery possible candidate pair of blocks is grouped into a bucket at the end of the Mapphase, each bucket is processed by a reducer and the reducer performs block nestedloop join between the blocks in the same bucket. E.g., in Fig. 2, the original data R ispartitioned into three blocks R00, R01 and R10, the possible candidate pairs of theblocks include 〈R00,R00〉, 〈R01,R01〉,〈R10,R10〉,〈R00,R01〉,〈R00,R10〉,〈R01,R10〉,〈R00,R00〉, 〈R01,R01〉,〈R10,R10〉 can be finished by using self-join, so just one blockis enough for each candidate pair. Every record needs to be replicated three times, andeach candidate pair is processed by one reducer.

4.2 Cost analysis

In this section we will analyse the cost of block nested loop join in MapReduce,the whole cost is mainly composed of communication cost and cpu cost. Figure 3shows the candidate pairs of the data blocks. Supposing that the original data set R

is partitioned into m blocks, the number of possible candidate pairs of the blocks foreach block are as the follows: R1 : m, R2 : m−1, . . . ,Rm : 1. The total pairs of blocksare TotalNum = ∑m

i=1(m − i + 1) = m(m + 1)/2, each pair contains two blocks, thewhole communication cost of BNLJ is: O(2 × (m(m + 1)/2) × (|R|/m)) = O((m +1)|R|). In terms of cpu cost, the cost for each pair is |R|2/m2, summing up all thepairs, the cost is: O((|R|2/m2)(m(m + 1)/2)) = O(|R|2(m + 1)/2m).


Fig. 3 Cost of block nestedloop join

5 Probability based prefix-filtering

5.1 Probability computation analysis

In this subsection we make a comprehensive analysis on the probability computation,and give two theorems that can be used to prune the candidate pairs in the followingstages.

Theorem 1 If∑

∀r ′∈rir ′.p < α, then ri can be safely pruned, that is, it is impossible

for ri to be similar with any other probabilistic set.

Proof According to Lemma 1, for any probabilistic set sj , we can obtain:


} =∑

∀r ′∈ri

∑

∀s′∈sj


(r ′, s′) ≥ γ

)

≤∑

∀r ′∈ri

∑

∀s′∈sj

r ′.p · s′.p

=∑

∀r ′∈ri

r ′.p ·∑

∀s′∈sj

s′.p.

Based on the definition of Probabilistic Set Data Models,∑

∀s′∈sjs′.p ≤ 1, so:


} ≤∑

∀r ′∈ri

r ′.p ·∑

∀s′∈sj

s′.p

≤∑

∀r ′∈ri

r ′.p

< α

According to the definition of Probabilistic Set Similarity Join, ri can be safelypruned beforehand. �


Fig. 4 Visualization of theprobability computation

Theorem 2 If∑

∀r ′∈rir ′.p · ∑∀s′∈sj

s′.p < α, then the candidate pair(ri , sj ) can bepruned.

Proof According to the proof of Theorem 1, we can get:


} ≤∑

∀r ′∈ri

r ′.p ·∑

∀s′∈sj

s′.p

< α

According to the definition of Probabilistic Set Similarity Join, the candidatepair(ri , sj ) can be pruned. �

The visualization of the probability computation:If we write pi = ∑

∀r ′∈rir ′.p and pj = ∑

∀s′∈sjs′.p, then:


} ≤∑

∀r ′∈ri

r ′.p ·∑

∀s′∈sj

s′.p = pi · pj

Let pi be the x-axis and pj be the y-axis, the visualization of pi · pj = α canbe depicted in Fig. 4. In Fig. 4, we draw a line pi · pj = α, if pi < α, ri can bepruned safely beforehand, because the product of pi and any other pj is impossiblegreater than α, so it is unnecessary to compare ri with any other probabilistic set. sjis the same case. If pi ·pj < α (corresponding to region C, pi and pj are bigger thanα), the pair(ri , sj ) should be filtered. If pi · pj > α (corresponding to region D), thepair(ri, sj ) should be the candidate pair and needs to be verified in the following step.

5.2 Probabilistic token frequency ordering

The token order will influence the filtering effect when we create the inverted in-dex using the prefix of the sets, so we have to find a suitable token order. It iswell known that the token frequency distribution always follows Zipf like distribu-tion, so in order to tackle the data skew problem and make sure that fewer can-didate pairs will be generated, we can sort the tokens in their increasing token-frequency order. In addition to that, each instance set has an existence probability,in order to indicate the real frequency order of the tokens, we take the existence


Fig. 5 Probabilistic token frequency ordering using MapReduce

probability into consideration when counting the token frequency. If we write thefrequency of token tk as ProbTF(tk), ProbTF(tk) = ∑m

i=1∑

∀rij ∈ririj .p ·χ(tk ∈ rij ),

where if tk ∈ rij is true, χ(.) equals 1, otherwise 0. E.g., in Fig. 5, ProbTF(A) =r11.p + r12.p + r21.p + r33.p = 0.3 + 0.3 + 0.4 + 0.1 = 1.1. After we get the proba-bilistic frequency of all the tokens, we can sort the tokens in the increasing frequencyorder, and the final result is {HKGFABCED}. Because the list of distinct tokens ismuch smaller than the original data size, we can directly sort the tokens in memory.

5.3 Combined prefix-filtering

In order to reduce the candidate pairs, in [23], Xiao et al. proposed the prefix fil-tering principle for the certain data. As there are more than one set instance for aprobabilistic set, we can not directly use the prefix filtering principe in probabilisticset similarity join. So we propose a new approach called Combinbed Prefix FilteringPrinciple to build the inverted index of the data sets.

Lemma 2 (Combined Prefix Filtering Principle) Considering two probabilistic setri , rj , and ri = {〈ri1,pri1〉, 〈ri2,pri2〉, . . . , 〈rim,prim〉}, rj = {〈rj1,prj1〉, 〈rj2,prj2〉,. . . , 〈rjn,prjn

〉}. According to the prefix filtering principle in [23], we can get the pre-fix for each probabilistic set instance: Pre(ri1), Pre(ri2), . . . ,Pre(rim) respectively.The combined prefix of ri is: ComPre(ri) = {Pre(ri1) ∪ Pre(ri2) ∪ · · · ∪ Pre(rim)}, soin the same way we can get: ComPre(rj ) = {Pre(rj1) ∪ Pre(rj2) ∪ · · · ∪ Pre(rjn)}.If Pr{sim(ri , rj ) ≥ γ } ≥ α, then ComPre(ri) and ComPre(ri) must share at least onetoken.


Proof

ComPre(ri) ∩ ComPre(rj ) = ∅

⇒ {Pre(ri1) ∪ · · · ∪ Pre(rim)

} ∩ {Pre(rj1) ∪ · · · ∪ Pre(rjn)

} = ∅

⇒⋃

u=1:m,v=1:n

(Pre(riu) ∩ Pre(rjv)

)∅

⇒ ∀u,vPre(riu) ∩ Pre(rjv) = ∅

According to the prefix filtering principle, we can get:

⇒ ∀u,vsim(riu, rjv) < γ

⇒ ∀u,vχ(sim(riu, rjv) ≥ γ

) = 0

⇒ Pr{sim(ri , rj ) ≥ γ

}

=∑

∀riu∈ri

∑

∀rjv∈rj

riu.p · rjv.p · χ(sim(riu, rjv) ≥ γ

)

= 0.

�

5.4 Map side pruning

In this subsection, we propose a pruning method at the Map task side to reduce thesets needed to be compared according Lemma 1, and propose a duplicate free com-paring approach to avoid the duplicated comparison.

5.4.1 Probability based pre-pruning

According to Lemma 1, if∑

∀r ′∈rir ′.p < α, then ri can be safely pruned. In the map

function, we can compute the sum of the existence probability of the instance foreach probabilistic set ri , if the sum is smaller than the threshold α, then ri needsnot be emitted out. For example in Fig. 6, r4 just contains one instance r41, and itsexistence probability is 0.4. If we write the upper bound of the probability product ofri and rj as UP _B(ri, rj ), the probability threshold γ = 0.6, then UP _B(r4, r1) =0.4×(0.3+0.3+0.2) = 0.32 < 0.6; UP _B(r4, r2) = 0.4×(0.4+0.3) = 0.28 < 0.6;UP _B(r4, r3) = 0.4 × (0.2 + 0.5 + 0.1) = 0.32 < 0.6, r4 can be pre-pruned safely.

5.4.2 Duplicate free comparing

In [22], the Map task will replicate each probabilistic set l times, where l is thelength of its prefix. The probabilistic set pairs sharing multiple tokens in their prefixwill be compared many times at different reducers. These duplicate comparisons willproduce additional computational cost, and also need post-processing to remove theduplicates. In order to eliminate these additional duplicate comparisons, we adopt


Fig. 6 Improved prefix filtering

Table 2 〈key, value〉 pairs

key value

A 〈r1, ,0.8; . . .〉B 〈r1,A,0.8; . . .〉C 〈r1,AB,0.8; . . .〉

key value

A 〈r2, ,0.7; . . .〉F 〈r2,A,0.7; . . .〉B 〈r2,AF ,0.7; . . .〉C 〈r2,AFB,0.7; . . .〉

Table 3 Prefix of the prefix

(token, ri ) Pre. of pre.

PrePre(A, r1) ∅

PrePre(B, r1) {A}PrePre(C, r1) {A B}

(token, ri ) Pre. of pre.

PrePre(A, r2) ∅

PrePre(F, r2) {A}PrePre(B, r2) {A F }PrePre(C, r2) {A F B}

some tricks. E.g., in Fig. 6, the Combined Prefix of r1 and r2 are ComPre(r1) ={A B C}, ComPre(r2) = {A F B C}, ComPre(r1) ∩ ComPre(r2) = {A B C}, r1 andr2 will be replicated three time by A B C. For each 〈key, value〉 pair, we add someadditional information to the value, the 〈key, value〉 pairs for ri and r2 are shownin Table 2. The italic bold tokens in Table 3 are the prefix of the prefix, written asPrePre(token, ri). In the reduce side, for a given token tk , we can decide whether thecandidate pair(ri, rj ) should be filtered according to the intersect of PrePre(tk, ri)and PrePre(tk, rj ). If PrePre(tk, ri) ∩ PrePre(tk, rj ) �= ∅, then it indicates that riand rj must have been compared in other token. In Table 3, we can observe thatPrePre(B, r1) ∩ PrePre(B, r2) = {A}, PrePre(C, r1) ∩ PrePre(C, r2) = {A B}, so


the candidate pair〈r1, r2〉 corresponding to token B and C needn’t to be comparedagain(the red pairs in Fig. 6).

5.5 Reduce side pruning

In order to reduce the comparison cost at the Reduce task side, we propose two prun-ing methods: probability sum based pruning method and probability upper boundbased pruning method.

5.5.1 Probability sum based pruning

According to Theorem 2, if∑

∀r ′∈rir ′.p · ∑

∀s′∈sjs′.p < α, then the candidate pair

(ri , sj ) can be pruned. In order to utilize this feature, in Map task, we pre-compute thesum of the existence probability of the set instance for each probabilistic set and makethe sum as one part of the value. In the Reduce side, we can use this information todecide whether the pair (ri, sj ) needs to be further verified or not. In Fig. 6, 0.8,0.7 and 0.8 are the probability sum of r1, r2 and r3 respectively. Supposing that theprobability threshold α = 0.6, and if we write the probability sum of ri for token tkas Prob(tk, ri), we can get: Prob(A, r1) = 0.8, Prob(A, r2) = 0.7, Prob(A, r3) = 0.8;Prob(F, r2) = 0.7, Prob(F, r3) = 0.8. Because Prob(A, r1) × Prob(A, r2) = 0.56 <

α, (r1, r2) can be filtered and needn’t to further compute the actual probability. It isthe same case with (r2, r3).

5.5.2 Probability upper bound based pruning

In this section, we want to derive the probability upper bound from Eq. (2)

sim(x, y) = J (x, y) = |x ∩ y||x ∪ y| = |x ∩ y|

|x| + |y| − |x ∩ y|So we can replace the equivalent condition in function χ(.) of Eq. (2) as follows:


}

=∑

∀r ′∈ri

∑

∀s′∈sj


(r ′, s′) ≥ γ

)

=∑

∀r ′∈ri

∑

∀s′∈sj

r ′.p · s′.p · χ( |r ′ ∩ s′|

|r ′| + |s′| − |r ′ ∩ s′| ≥ γ

)

≤∑

∀r ′∈ri

∑

∀s′∈sj

r ′.p · s′.p · χ(

Min{|r ′|, |s′|}|r ′| + |s′| − |r ′ ∩ s′|| ≥ γ

)

=∑

∀r ′∈ri

∑

∀s′∈sj

r ′.p · s′.p · χ(

Min{|r ′|, |s′|}Max{|r ′|, |s′|} ≥ γ

)

= UB_P(ri, sj )


We can use UB_P(ri, sj ) as the probability upper bound of the pair (ri , sj ). Aslong as we can pre-compute the length of each probabilistic set instance, we canobtain the upper bound value. If UB_P(ri, sj ) < α, the pair(ri, sj ) can be filteredsafely.

In order to compute the Probability Upper Bound, we need to pre-compute thelength of each set instance in Map side, and add the length information into thevalue. In the Reduce side, we can use the length information to figure out the Proba-bility Upper Bound, if UB_P(ri, sj ) < α, then the pair(ri, sj ) can be filtered safely.

5.6 Hybrid solution: both map side and reduce side pruning

In order to further improve the performance of the high dimensional similarity join,we try to propose a hybrid index solution that uses both Map side pruning and Reduceside pruning methods together. At the Map side, we use the probability based pre-pruning to eliminate the probabilistic sets that have no any chance to be similar toother sets, then at the Reduce side, we further use probability sum based pruning andprobability upper bound based pruning to filter the pairs.

6 Experimental evaluations

In this section, we did comprehensive experiments to test the performance of ourproposed approaches: Hadoop Join by Map Side Pruning(HJ-MSP), Hadoop Join byReduce Side Pruning(HJ-RSP) and Hadoop Join by Hybrid Pruning(HJ-HybridP).The evaluation measures include run time and speedup ratio. run time refers to thetotal time cost of the similarity join procedure, and speedup ratio is defined as the runtime of Block Nested Loop Join(BNLJ) divided by that of HJ-MSP, HJ-RSP and HJ-HybridP. In addition to run time and speedup ratio, we also test the system scaleupand system speedup with different number of computer nodes.

Experimental Setup Our experiments are implemented on Hadoop-0.20.2, the clus-ter size is 16 nodes that are connected with 1 Gbit Ethernet switch, the configura-tion for each node is as the follows: CPU:Q9650 3.00 GHz, memory: 4 GB, disk:500 GB, OS: 64 bit Ubuntu9.10 server. The main parameters and values used in theexperiments are described in Table 4. The default values are presented in bold.

Table 4 Settings used in theexperiments Parameter Values

Probabilistic threshold: α 0.1, 0.3, 0.5, 0.7, 0.9

Similarity threshold: γ 0.1, 0.3, 0.5, 0.7, 0.9

Data size(K): N 100, 200, 300, 400, 500

Number of Computer Node 2, 4, 8, 16


Fig. 7 Performance vs. α

Datasets We have downloaded a DBLP dataset from the internet,1 it is used in [17].The dataset has about 1,300,000 publication titles, we generated the required proba-bilistic datasets from the real DBLP dataset, each probabilistic set has 1 to 5 proba-bilistic set instance, the number of set instances for a given probabilistic set followszipf distribution, and the skew factor is set to 0.5; Each set instance has a existenceprobability, the sum of the existence probability of the set instances for a given proba-bilistic set is less than or equal to 1. We totally generated four datasets: 100 K, 200 K,300 K, 400 K and 500 K.

6.1 Performance vs. probabilistic threshold

Figure 7 shows the performance vs. probabilistic threshold α. With α increasing, therun time of HJ-MSP, HJ-RSP and HJ-HybridP decreases, because in the similarityjoin procedure, we use the probabilistic threshold α to filter some records, if the sumof existence probability of the set instances is less than α, the corresponding prob-abilistic set can be filtered safely. Bigger α, more probabilistic sets will be filtered,so the run time will decrease as α increases. We also can see that the performanceof HJ-MSP is better than that of HJ-RSP when α is over than 0.7, because moreprobabilistic sets can be filtered in advance when the probabilistic threshold α is big-ger. Compared with Block Nested Loop Join (BNLJ), the performance of HJ-MSP,HJ-RSP and HJ-HybridP is much better than that of BNLJ, and the speedup ratiocan be up to 3350, 2850 and 3711 respectively given α = 0.9. The performance ofHJ-HybridP is the best.

6.2 Performance vs. similarity threshold

In this subsection, we test the performance varying the similarity threshold γ . FromFig. 8a we find that the run time of all the methods decrease as the similarity thresh-old increases, because as γ increases, the prefix length of each set becomes shorter,

1http://www.cs.brown.edu/~hkimura/upi_dataset.html.

http://www.cs.brown.edu/~hkimura/upi_dataset.html


Fig. 8 Performance vs. γ

Fig. 9 Performance vs. data size N

so less candidate probabilistic sets will be sent to the reduce side. When the sim-ilarity threshold is over than 0.3, the performance of HJ-RSP is a little better thanthat of HJ-MSP, while when γ exceeds 0.9, HJ-MSP and HJ-RSP have roughly thesame performance. Compared with Block Nested Loop Join (BNLJ), the performancespeedup ratio of HJ-MSP, HJ-RSP and HJ-HybridP can be up to 11617, 11703 and11703 respectively given γ = 0.9. The performance of HJ-HybridP is also the bestfor all similarity threshold.

6.3 Performance vs. data size

Figure 9 shows the performance varying the data size and the data size varies from100 K to 500 K. The run time of all the three methods increases near linearly asdata size increases. The run time of HJ-MSP increases a littler faster than that ofHJ-RSP. Compared to BNLJ, the speedup ratio increases while data size growing,because the run time of BNLJ increases more faster than that of HJ-MSP, HJ-RSPand HJ-HybridP.


Fig. 10 Scaleup

Fig. 11 Speedup

6.4 Performance scalability

In this section, we mainly test the scalability of our proposed solutions, we variedthe number of computer node and the data size: 2 ∗ 50 K, 4 ∗ 100 K, 8 ∗ 200 K and16 ∗ 400 K. From Fig. 10 we find that the run time of HJ-MSP, HJ-RSP and HJ-HybridP grows slowly or nearly linearly, so we can conclude that they have goodscalability.

6.5 Performance speedup

In this section, we mainly test the speedup of our proposed solutions, in this exper-iment, we fixed the data size as 100 K and varied the number of computer node:2, 4, 8 and 16. Here the definition of speedup ratio is a little different from that ofSects. 6.1, 6.2 and 6.3, it can be calculated as the run time of 2 nodes divided by thatof 4 nodes, and so on. From Fig. 11 we find that the speedup ratio of HJ-MSP, HJ-RSP and HJ-HybridP grows linearly with the number of computer nodes increasingat given fixed size data. It demonstrates that for fixed size data HJ-MSP, HJ-RSP andHJ-HybridP have good scalability.

6.6 Summary analysis

According to the above experiment results we can find that HJ-MSP has better perfor-mance when the probabilistic threshold is over than 0.7 for a given similarity thresh-old γ , so HJ-MSP is more preferable than HJ-RSP for bigger probabilistic threshold.


While for a given probabilistic threshold α, the performance of the three approachesis not very sensitive to different similarity threshold γ , HJ-RSP has a little betterperformance when the similarity threshold is over than 0.3. From the above experi-ment results we can find out that the hybrid solution has the best performance in mostcases.

6.7 Communication cost analysis

In this subsection, we try to conduct a general analysis of the communication cost,we can use replication rate to represent the communicate cost and the replication raterefers to the average number each set is replicated to the reducers. For a probabilisticset ri containing m instances: ri1, . . . , rim, supposing the similarity threshold is γ ,|rij | represent the length of the instance rij , the prefix length of the instance rij is�(1 − γ )|rij | + 1�, according to the definition of the combined prefix in Sect. 5.3, themaximum length of ComPre(ri) is

∑mj=1�(1 − γ )|rij | + 1�, each set ri will be repli-

cated∑m

j=1�(1−γ )|rij |+1� times. For the long set, it will be replicated many timesand the communication cost will be very high. In order to reduce the communicationcost, we proposed some novel methods to filter the set in advance.

7 Conclusions

In this paper, we try to exploit Set similarity join on massive probabilistic data usingMapReduce paradigm. We proposed Hadoop Join by Map Side Pruning method tofilter the probabilistic sets that have no chance to be similar with any other set, andHadoop Join by Reduce Side Pruning method to reduce the candidate pairs, we alsoproposed a hybrid solution that used both Map side pruning and Reduce side pruning.The experiments results show that our proposed approaches have much better perfor-mance than Block Nested Loop Join method and have good scalability. In the future,we plan to do some further study on large scale probabilistic data, such as k-NearestNeighbor query, clustering etc.

Acknowledgements This research was partially supported by the grants from the Natural ScienceFoundation of China (No. 61070055, 91024032, 91124001); the National 863 High-tech Program (No.2012AA011001, 2013AA013204); the Fundamental Research Funds for the Central Universities, and theResearch Funds of Renmin University (No. 11XNL010).

References

1. Afrati, F.N., Sarma, A.D., Menestrina, D., Parameswaran, A.G., Ullman, J.D.: Fuzzy joins usingMapReduce. In: ICDE’12, pp. 498–509 (2012)

2. Afrati, F.N., Ullman, J.D.: Optimizing joins in a map-reduce environment. In: EDBT’10, pp. 99–110(2010)

3. Arasu, A., Ganti, V., Kaushik, R.: Efficient exact set-similarity joins. In: VLDB’06, pp. 918–929(2006)

4. Arasu, A., Ganti, V., Kaushik, R.: Efficient exact set-similarity joins. In: VLDB’06, pp. 918–929(2006)


5. Baraglia, R., Morales, G.D.F., Lucchese, C.: Document similarity self-join with MapReduce. In:ICDM’10, pp. 731–736 (2010)

6. Bayardo, R.J., Ma, Y., Srikant, R.: Scaling up all pairs similarity search. In: WWW’07, pp. 131–140(2007)

7. Broder, Glassman, S.C., Manasse, M.S., Zweig, G.: Syntactic clustering of the web. Comput. Netw.(1997). doi:10.1016/S0169-7552(97)00031-7

8. Chaudhuri, S., Ganti, V., Kaushik, R.: A primitive operator for similarity joins in data cleaning. In:ICDE’06 (2006)

9. Cheng, R., Singh, S., Prabhakar, S., Shah, R., Vitter, J.S., Xia, Y.: Efficient join processing overuncertain data. In: CIKM’06, pp. 738–747 (2006)

10. Dean, J., Ghemawat, S.: Mapreduce: Simplified data processing on large clusters. In: OSDI’04, pp.137–150 (2004)

11. Dong, X.L., Halevy, A.Y., Yu, C.: Data integration with uncertainty. VLDB J. (2009) doi:10.1007/s00778-008-0119-9

12. Elsayed, T., Lin, J.J., Oard, D.W.: Pairwise document similarity in large collections with MapReduce.In: ACL (Short Papers)’08, pp. 265–268 (2008)

13. Ghemawat, S., Gobioff, H., Leung, S.T.: The Google file system. In: SOSP’03, pp. 29–43 (2003)14. Henzinger, M.R.: Finding near-duplicate web pages: a large-scale evaluation of algorithms. In: SI-

GIR’06, pp. 284–291 (2006)15. Jestes, J., Li, F., Yan, Z., Yi, K.: Probabilistic string similarity joins. In: SIGMOD Conference’10, pp.

327–338 (2010)16. Kim, Y., Shim, K.: Parallel top-k similarity join algorithms using MapReduce. In: ICDE, pp. 510–521

(2012)17. Kimura, H., Madden, S., Zdonik, S.B.: Upi: a primary index for uncertain databases. In: PVLDB, pp.

630–637 (2010)18. Kriegel, H.P., Kunath, P., Pfeifle, M., Renz, M.: Probabilistic similarity join on uncertain data. In:

DASFAA’06, pp. 295–309 (2006)19. Lian, X., Chen, L.: Set similarity join on probabilistic data. In: PVLDB, pp. 650–659 (2010)20. Luo, W., Tan, H., Mao, H., Ni, L.: Efficient similarity joins on massive high-dimensional datasets

using MapReduce. In: MDM’12, p. TBA (2012)21. Okcan, A., Riedewald, M.: Processing theta-joins using MapReduce. In: SIGMOD Conference’11,

pp. 949–960 (2011)22. Vernica, R., Carey, M.J., Li, C.: Efficient parallel set-similarity joins using MapReduce. In: SIGMOD

Conference’10, pp. 495–506 (2010)23. Xiao, C., Wang, W., Lin, X., Yu, J.X., Wang, G.: Efficient similarity joins for near-duplicate detection.

ACM Trans. Database Syst. (2011). doi:10.1145/2000824.200082524. Yang, H.-c., Dasdan, A., Hsiao, R.-L., Stott Parker, D.: Map-reduce-merge: simplified relational data

processing on large clusters. In: SIGMOD Conference’07, pp. 1029–1040 (2007)

http://dx.doi.org/10.1016/S0169-7552(97)00031-7

http://dx.doi.org/10.1007/s00778-008-0119-9

http://dx.doi.org/10.1007/s00778-008-0119-9

http://dx.doi.org/10.1145/2000824.2000825

Documents

Set similarity join on massive probabilistic data using MapReduceidke.ruc.edu.cn/publications/2014/Set similarity join on... · 2018-03-21 · reduce.Map task is responsible for dealing