Upload
gates
View
45
Download
1
Tags:
Embed Size (px)
DESCRIPTION
Weighted Exact Set Similarity Join. The Pennsylvania State University Dongwon Lee [email protected]. Set Similarity Join. Def. Set Similarity Join ( SSJoin ): Between collections A and B, find X pairs of objects whose similarity > t: If X = “MOST” Approximate SSJoin - PowerPoint PPT Presentation
Citation preview
Weighted Exact Set Similarity Join
The Pennsylvania State University
Dongwon Lee
Set Similarity Join
Def. Set Similarity Join (SSJoin): Between collections A and B, find X pairs of objects whose similarity > t: If X = “MOST” Approximate SSJoin If X = “ALL” Exact SSJoin
Wisconsin DB Seminar, 2009 2
A B
0.7
0.5
0.40.9
0.1
0.2
: {Lake, Monona, Wisc, Dane, County}
: {University, Mendota, Wisc, Dane,}
Set Similarity Join
Weighted vs. Unweighted Weighting quantifies relative importance of
token Eg, “Microsoft” is more important than “Copr.”
How to assign meaningful weights to tokens is an important problem itself
Not further discussed here
Wisconsin DB Seminar, 2009 3
Set Similarity Join
Approximate SSJoin Allows some false positives/negatives Eg, LSH as solution
Exact SSJoin Does not allow any false positives/negatives Needs to be scalable
Weighted + Exact SSJoin Will simply call “WESSJoin”
Wisconsin DB Seminar, 2009 4
unweighted weighted
exact
approx.
WESSJoin
WASSJoinUASSJoin
UESSJoin
Applications of WESSJoin
Entity resolution Web document genre classification
Find all pairs of documents w. similar contents Query refinement for web search
For a query, find another w. similar search result Movie recommendation
Identify users who have similar movie tastes w.r.t. the rented movies
Focus on string data represented as SET Eg, document, web page, record
Wisconsin DB Seminar, 2009 5
Research Issues
Why not express WESSJoin in SQL? Join predicate as UDF Cartesian product followed by UDF processing
Inefficient evaluation Special handling for WESSJoin needed
Scalability Support diverse similarity (or distance) functions
o Eg, Overlap, Jaccard, Cosine vs. Edit, … Support diverse computation models
o Eg, Threshold vs. Top-k
Wisconsin DB Seminar, 2009 6
Similarity/Distance Functions
Jaccard Coefficient: J(x,y) =
Overlap similarity: O(x,y) =
Cosine similarity: C(x,y) =
Hamming distance H(x,y) =
Levenshtein distance L(x,y): min # of edit operations to transform x to y
Wisconsin DB Seminar, 2009 7
| x y || x y |
x y
ix iy
ix y
(x y) (y x)
Properties of sim()
Similarity functions can be re-written to each other equivalently J(x,y) > t O(x,y) > t/(1+t) (|x|+|y|) O(x,y) > t H(x,y) < |x|+|y|-2t C(x,y) > t O(x,y) >
Eg, x: {Lake, Mendota, Monona} y: {Wisc, Dane, Mendota, Lake} J(x,y) > 0.5 ? O(x,y) > 2.3 ?
Set representation: k-gram, word, phrase, …
Wisconsin DB Seminar, 2009 8
t x y
Naïve Solution
All pair-wise comparison between A and B
Nested-loop: |A||B| comparisons The sim() evaluation may be costly
Eg, Generalized Jaccard Similarity function with O(|x|3)
Wisconsin DB Seminar, 2009 9
For x in A: For y in B:
If sim(x,y) > t, return (x,y);
A, B: tablex, y: record as set
Naïve Solution Example
Wisconsin DB Seminar, 2009 10
ID Content
1 {Lake, Mendota}
2 {Lake, Monona, Area}
3 {Lake, Mendota, Monona, Dane}
ID Content
4 {Lake, Monona, University}
5 {Monona, Research, Area}
6 {Lake, Mendota, Monona, Area}
A B
O(x,y) ID=4 ID=5 ID=6
ID=1 1 0 2
ID=2 2 2 3
ID=3 2 1 3
O(x,y) > 2 ?
Naïve Solution Example
Wisconsin DB Seminar, 2009 11
ID Content
1 {Lake, Mendota}
2 {Lake, Monona, Area}
3 {Lake, Mendota, Monona, Dane}
ID Content
4 {Lake, Monona, University}
5 {Monona, Research, Area}
6 {Lake, Mendota, Monona, Area}
A B
J(x,y)) ID=4 ID=5 ID=6
ID=1 0.25 0 0.5
ID=2 0.5 0.4 0.75
ID=3 0.2 0.16 0.6
J(x,y) > 0.6 ?
2-Step Framework
Step 1: “Blocking” Using Index/heuristics/filtering/etc, reduce # of
candidates to compare Step 2: sim() only within candidate sets
O(|A||C|) s.t. |C| << |B|
Wisconsin DB Seminar, 2009 12
For x in A: Using Foo, find a candidate set C in B For y in C:
If sim(x,y) > t, return (x,y);
Variants for “Foo”
“Foo”: How to identify candidate set C Fast Accurate: no false positives/negatives
Many Variants for “Foo” Inverted Index [Sarawagi et al, SIGMOD 04] Size filtering [Arasu et al, VLDB 06] Prefix Index [Chaudhuri et al, ICDE 06] Prefix + Inverted Index [Bayardo et al, WWW 07] Bound filtering [On et al, ICDE 07] Position Index [Xiao et al, WWW 08]
Wisconsin DB Seminar, 2009 13
Inverted Index [Sarawagi et al, SIGMOD 04]
Wisconsin DB Seminar, 2009 14
ID Content
1 {Lake, Mendota}
2 {Lake, Monona, Area}
3 {Lake, Mendota, Monona, Dane}
ID Content
4 {Lake, Monona, University}
5 {Monona, Research, Area}
6 {Lake, Mendota, Monona, Area}
A B
Token in A ID List
Area 2
Dane 3
Lake 1, 2, 3
Mendota 1, 3
Monona 2, 3
Inverted Index (IDX) for A
Token in B ID List
Area 5
Lake 4, 6
Mendota 6
Monona 4, 5, 6
Research 5
University 4
Inverted Index (IDX) for B
Inverted Index [Sarawagi et al, SIGMOD 04]
Wisconsin DB Seminar, 2009 15
ID Content
1 {Lake, Mendota}
2 {Lake, Monona, Area}
3 {Lake, Mendota, Monona, Dane}
ID Content
4 {Lake, Monona, University}
5 {Monona, Research, Area}
6 {Lake, Mendota, Monona, Area}
A B
Token in B ID List
Area 5
Lake 4, 6
Mendota 6
Monona 4, 5, 6
Research 5
University 4
Inverted Index (IDX) for BFor x in A: Using IDX, find a candidate set C in B For y in C:
If sim(x,y) > t, return (x,y);
ID=1: {Lake, Mendota}
ID=2: …
ID=3: …
Candidate set C: {4,6} + {6} = {4, 6}
Inverted Index [Sarawagi et al, SIGMOD 04]
Wisconsin DB Seminar, 2009 16
ID Content
1 {Lake, Mendota}
2 {Lake, Monona, Area}
3 {Lake, Mendota, Monona, Dane}
ID Content
4 {Lake, Monona, University}
5 {Monona, Research, Area}
6 {Lake, Mendota, Monona, Area}
A B
Token in B ID List
Area 5
Lake 4, 6
Mendota 6
Monona 4, 5, 6
Research 5
University 4
Inverted Index (IDX) for B
ID=1: {Lake, Mendota}
ID=2: …
ID=3: …
ID Freq.
4 1
6 2
Candidate set C:
O(x,y) > 2
For x in A: Using IDX, find a candidate set C in B For y in C:
If sim(x,y) > t, return (x,y);
Size Filtering [Arasu et al, VLDB 06]
Idea: Build index on the size of inputs Jaccard Coefficient J= Upperbound for Jaccard:
Bounding |y| w.r.t. |x|: Combining two
Wisconsin DB Seminar, 2009 17
| x y || x y |
| x y || x y |
min(| x |,| y |)
max(| x |,| y |)
x xy y
| x y || x y |
| y |
| x |
| x y || x y |
| x |
| y |
J * x y | x |
J
Size Filtering [Arasu et al, VLDB 06]
Intuition: If t and |x| are given, |y| is bounded Eg,
x: {Lake, Mendota} y: {Lake, Mendota, Monona, Area} J(x,y) > 0.8 ?
Then, according to: |x|=2, t=0.8 1.6 <= |y| <= 2.5 However, |y| = 4 y cannot satisfy t=0.8 no need to compute
J(x,y) at all
Wisconsin DB Seminar, 2009 18
J * x y | x |
J
Size Filtering [Arasu et al, VLDB 06]
Algorithm For all input strings, build B-tree w.r.t. their sizes Given a set x, using B-tree index, find a candidate
y in B s.t.
Wisconsin DB Seminar, 2009 19
For x in A: Using IDX, find a candidate set C in B For y in C:
If sim(x,y) > t, return (x,y);
J * x y | x |
J
Prefix Index [Chaudhuri et al, ICDE 06]
Intuition: If two sets are very similar, their prefixes, when ordered, must have some common tokens
Eg. x: {Dane, University, Monona, Mendota} y: {Area, Lake, Mendota, Monona, Wisc} O(x,y) > 3 ?
x’: {Dane, Mendota, Monona, University} y’: {Area, Lake, Mendota, Monona, Wisc}
Wisconsin DB Seminar, 2009 20
Prefixes
Prefix Index [Chaudhuri et al, ICDE 06]
Theorem 1: If there is no overlap btw. Prefix(x) and Prefix(y), then sim(x,y) > t, where: If sim()=Overlap, Prefix(x)=|x| - (t-1) If sim()=Jaccard, Prefix(x)=|x|-Ceiling(t*|x|)+1
Algorithm using Theorem 1: Given a set x For each token t_x in the prefix of x
o Using an index, locate a candidate y that contains t_x in the prefix of y
o If sim(x,y) > t, return (x,y)
Wisconsin DB Seminar, 2009 21
Wisconsin DB Seminar, 2009 22
ID Content
1 {Lake, Mendota}
2 {Lake, Monona, Area}
3 {Lake, Mendota, Monona, Dane}
ID Content
4 {Lake, Monona, University}
5 {Monona, Research, Area}
6 {Lake, Mendota, Monona, Area}
A B
Token ID List DF Order
Area 2, 5 2 4
Dane 3 1 1
Lake 1, 2, 3, 4, 6 5 6
Mendota 1, 3, 6 3 5
Monona 2, 3, 4, 5, 6 5 7
Research 5 1 2
University 4 1 3
Inverted Index (IDX) for both A and B
Prefix + Inverted Index [Bayardo et al, WWW 07]
Create a universal order:Put rare tokens front
Order: Dane > Research > University > Area > Mendota > Lake > Monona
Wisconsin DB Seminar, 2009 23
ID Content
1 {Mendota, Lake}
2 {Area, Lake, Monona}
3 {Dane, Mendota, Lake, Monona}
ID Content
4 {University, Lake, Monona}
5 {Research, Area, Monona}
6 {Area, Mendota, Lake, Monona}
Ordered A Ordered B
Prefix + Inverted Index [Bayardo et al, WWW 07]
Order: Dane > Research > University > Area > Mendota > Lake > Monona
Wisconsin DB Seminar, 2009 24
ID Content
1 {Mendota, Lake}
2 {Area, Lake, Monona}
3 {Dane, Mendota, Lake, Monona}
ID Content
4 {University, Lake, Monona}
5 {Research, Area, Monona}
6 {Area, Mendota, Lake, Monona}
Ordered A Ordered B
Token in B ID List
Area 5
Lake 4, 6
Mendota 6
Research 5
University 4
Prefix Inverted Index for B
ID=1: {Mendota, Lake}
ID=2: …
ID=3: …
Candidate set C: {6}
O(x,y) > 2Prefix(x)=|x|-(t-1)=|x|-1
Prefix + Inverted Index [Bayardo et al, WWW 07]
Wisconsin DB Seminar, 2009 25
ID Content
1 {Mendota, Lake}
2 {Area, Lake, Monona}
3 {Dane, Mendota, Lake, Monona}
ID Content
4 {University, Lake, Monona}
5 {Research, Area, Monona}
6 {Area, Mendota, Lake, Monona}
Ordered A Ordered B
Token in B ID List
Area 5
Lake 4, 6
Mendota 6
Research 5
University 4
Prefix Inverted Index for B
ID=1: …
ID=2: {Area, Lake, Monona}
ID=3: … Candidate set C: {5} + {4,6} = {4,5,6}
O(x,y) > 2Prefix(x)=|x|-(t-1)=|x|-1
Prefix + Inverted Index [Bayardo et al, WWW 07]
Wisconsin DB Seminar, 2009 26
ID Content
1 {Mendota, Lake}
2 {Area, Lake, Monona}
3 {Dane, Mendota, Lake, Monona}
ID Content
4 {University, Lake, Monona}
5 {Research, Area, Monona}
6 {Area, Mendota, Lake, Monona}
Ordered A Ordered B
Token in B ID List
Area 5
Lake 4, 6
Mendota 6
Research 5
University 4
Prefix Inverted Index for B
ID=1: …
ID=2: …
ID=3: {Dane, Mendota, Lake, Monona}
Candidate set C: {6} + {4,6} = {4,6}
O(x,y) > 2Prefix(x)=|x|-(t-1)=|x|-1
Prefix + Inverted Index [Bayardo et al, WWW 07]
Position Index [Xiao et al, WWW 08]
Eg, x: {Dane, Research, Area, Mendota, Lake} y: {Research, Area, Mendota, Lake, Monona} O(x,y) > 4 ?
Prefix(x) = Prefix(y) = 5 – (4 -1) = 2 x: {Dane, Research, Area, Mendota, Lake} y: {Research, Area, Mendota, Lake, Monona} “Research” is common btw prefixes (x,y) is a
candidate pair need to compute sim(x,y)
Wisconsin DB Seminar, 2009 27
Order: Dane > Research > University > Area > Mendota > Lake > Monona
Position Index [Xiao et al, WWW 08]
Eg, x: {Dane, Research, Area, Mendota, Lake} y: {Research, Area, Mendota, Lake, Monona} O(x,y) > 4 ?
Prefix(x) = Prefix(y) = 5 – (4 -1) = 2 x: {Dane, Research, Area, Mendota, Lake} y: {Research, Area, Mendota, Lake, Monona} Estimation of max overlap = overlap in prefixes +
min # of unseen tokens = 1 + min(3,4) = 4 > t No need to compute sim(x,y) !
Wisconsin DB Seminar, 2009 28
Order: Dane > Research > University > Area > Mendota > Lake > Monona
Wisconsin DB Seminar, 2009 29
Bound Filtering [On et al, ICDE 07]
Generalized Jaccard (GJ) similarity Two sets: x = {a1, …, a|x|}, y = {b1, …, b|y|} Normalized weight of the maximum bipartite
matching M in the bipartite graph (N = x U y, E=x X y)
GJ(x, y)sim(ai,b j )(a i ,b j )M
x y M
Wisconsin DB Seminar, 2009 30
Bound Filtering [On et al, ICDE 07]
x y
0.9 0.73 2 2
1.6
30.53
GJ(x, y)sim(ai,b j )(a i ,b j )M
x y M
0.7
0.5
0.40.9
0.1
0.2
x y
0.7
0.5
0.40.9
0.1
0.2
M: maximum weight bipartite matching
Wisconsin DB Seminar, 2009 31
Bound Filtering [On et al, ICDE 07]
Issues GJ captures more semantics btw. two sets via
the weighted bipartite matching than Jaccard But more costly to compute: maximum weight
bipartite matchingo Bellman-Ford: O(V2E) o Hungarian: O(V3)
For x in A: Using Foo, find a candidate set C in B For y in C:
If GJ(x,y) > t, return (x,y);
Wisconsin DB Seminar, 2009 32
Bound Filtering [On et al, ICDE 07]
Bipartite matching computation is expensive because of the requirement No node in the bipartite graph can have more
than one edge incident on it Relax this constraint:
For each element ai in x, find an element bj in y with the highest element-level similarity S1
For each element bj in y, find an element ai in x with the highest element-level similarity S2
Complexity becomes linear: O(|x|+|y|)
Wisconsin DB Seminar, 2009 33
Bound Filtering [On et al, ICDE 07]
x y
0.7
0.5
0.40.9
0.1
0.2
x y
0.7
0.5
0.40.9
0.1
0.2
x y
0.7
0.5
0.40.9
0.1
0.2
S1
S2
S1
S2
Wisconsin DB Seminar, 2009 34
Bound Filtering [On et al, ICDE 07]
GJ(x, y)sim(ai,b j )(a i ,b j )M
x y M
UB(x, y)sim(ai,b j )(a i ,b j )S1S2
x y S1 S2
LB(x, y)sim(ai,b j )(a i ,b j )S1S2
x y S1 S2
Properties: Numerator of UB is at
least as large as that of GJ
Denominator of UB is no larger than that of GJ
Similar arguments for LB
Theorem 2 LB <= GJ <= UB
Wisconsin DB Seminar, 2009 35
Bound Filtering [On et al, ICDE 07]
Algorithm Compute UB(x,y) If UB(x,y) <= t GJ(x,y) <= t (x,y) is not an
answer Else Compute LB(x,y) If LB(x,y) > t GJ(x,y) > t (x,y) is an answer Else compute GJ(x,y)
For x in A: Using Foo, find a candidate set C in B For y in C:
If GJ(x,y) > t, return (x,y);
LB <= GJ <= UB
Takeaways
WESSJoin finds ALL pairs of sets btw two collections whose similarity > t Good abstraction for various problems
2 step framework is promising Step 1: reduce candidates Step 2: similarity computation among candidates
Less researched issues Comparison among different WESSJoin methods WESSJoin + top-k/skyline/MapReduce/etc
Wisconsin DB Seminar, 2009 36
Reference [Sarawagi et al, SIGMOD 04] Sunita Sarawagi, Alok Kirpal: Efficient set
joins on similarity predicates, SIGMOD 2004. [Arasu et al, VLDB 06] Arvind Arasu, Venkatesh Ganti, and Raghav
Kaushik, Efficient exact set-similarity joins, VLDB 2006. [Chaudhuri et al, ICDE 06] Surajit Chaudhuri, Venkatesh Ganti, Raghav
Kaushik: A Primitive Operator for Similarity Joins in Data Cleaning. ICDE 2006.
[Bayardo et al, WWW 07] R. J. Bayardo, Yiming Ma, Ramakrishnan Srikant. Scaling Up All-Pairs Similarity Search, WWW 2007.
[On et al, ICDE 07] Byung-Won On, Nick Koudas, Dongwon Lee, Divesh Srivastava, Group Linkage, ICDE 2007.
[Xiao et al, WWW 08] Chuan Xiao, Wei Wang, Xuemin Lin, Jeffrey Xu Yu. Efficient Similarity Joins for Near Duplicate Detection. WWW 2008.
Wei Wang. Efficient Exact Similarity Join Algorithms: http://www.cse.unsw.edu.au/~weiw/project/PPJoin-UTS-Oct-2008.pdf
Jeffrey D. Ullman. High-Similarity Algorithms: http://infolab.stanford.edu/~ullman/mining/2009/similarity4.pdf
Wisconsin DB Seminar, 2009 37