Upload
winfred-morrison
View
217
Download
0
Tags:
Embed Size (px)
Citation preview
iSchool, Cloud Computing Class Talk, Oct 6th 2008 1
Computing Pairwise Document Computing Pairwise Document Similarity in Large Collections:Similarity in Large Collections:
A MapReduce PerspectiveA MapReduce Perspective
Tamer Elsayed, Jimmy Lin, and Douglas W. OardTamer Elsayed, Jimmy Lin, and Douglas W. Oard
Computing Pairwise Document Similarity in Large Collections: A MapReduce Perspective 2
Overview
Abstract Problem Trivial Solution MapReduce Solution Efficiency Tricks
Computing Pairwise Document Similarity in Large Collections: A MapReduce Perspective 3
Abstract Problem
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
0.200.300.540.210.000.340.340.130.74
0.200.300.540.210.000.340.340.130.74
0.200.300.540.210.000.340.340.130.74
0.200.300.540.210.000.340.340.130.74
0.200.300.540.210.000.340.340.130.74
Applications: Clustering Coreference resolution “more-like-that” queries
Computing Pairwise Document Similarity in Large Collections: A MapReduce Perspective 4
Similarity of Documents
Simple inner product Cosine similarity Term weights
Standard problem in IR tf-idf, BM25, etc.
di
dj
Computing Pairwise Document Similarity in Large Collections: A MapReduce Perspective 5
Trivial Solution
load each vector o(N) times load each term o(dft
2) times
scalable and efficient solutionfor large collections
Goal
Computing Pairwise Document Similarity in Large Collections: A MapReduce Perspective 6
Better Solution
Load weights for each term once Each term contributes o(dft
2) partial scores Allows efficiency tricks
Each term contributes only if appears in
Computing Pairwise Document Similarity in Large Collections: A MapReduce Perspective 7
Decomposition MapReduce
Load weights for each term once Each term contributes o(dft
2) partial scores
Each term contributes only if appears in
mapmap
indexindex
reducereduce
Computing Pairwise Document Similarity in Large Collections: A MapReduce Perspective 8
MapReduce Framework
mapmap
mapmap
mapmap
mapmap
reducereduce
reducereduce
reducereduce
input
input
input
input
output
output
output
ShufflingShuffling
group values group values by: by: [[keyskeys]]
(a) Map(a) Map (b) Shuffle(b) Shuffle (c) Reduce(c) Reduce
handles low-level details transparentlytransparently
(k2, [v2])(k1, v1)
[(k3, v3)][k2, v2]
Computing Pairwise Document Similarity in Large Collections: A MapReduce Perspective 9
Standard Indexing
tokenizetokenize
tokenizetokenize
tokenizetokenize
tokenizetokenize
combinecombine
combinecombine
combinecombine
doc
doc
doc
doc
posting list
posting list
posting list
ShufflingShuffling
group values group values by: by: termsterms
(a) Map(a) Map (b) Shuffle(b) Shuffle (c) Reduce(c) Reduce
Computing Pairwise Document Similarity in Large Collections: A MapReduce Perspective 10
Indexing (3-doc toy collection)
Clinton
Barack
Cheney
Obama
Indexing
2
1
1
1
1
ClintonObamaClinton 1
1
ClintonCheney
ClintonBarackObama
ClintonObamaClinton
ClintonCheney
ClintonBarackObama
Computing Pairwise Document Similarity in Large Collections: A MapReduce Perspective 11
Pairwise Similarity(a) Generate pairs(a) Generate pairs (b) Group pairs(b) Group pairs (c) Sum pairs(c) Sum pairs
Clinton
Barack
Cheney
Obama
2
1
1
1
1
1
1
22
22
11
1111
22
22 22
22
11
1133
11
Computing Pairwise Document Similarity in Large Collections: A MapReduce Perspective 12
Pairwise Similarity (abstract)(a) Generate pairs(a) Generate pairs (b) Group pairs(b) Group pairs (c) Sum pairs(c) Sum pairs
multiplymultiply
multiplymultiply
multiplymultiply
multiplymultiply
sumsum
sumsum
sumsum
term postings
term postings
term postings
term postings
similarity
similarity
similarity
ShufflingShuffling
group values group values by: by: pairspairs
Computing Pairwise Document Similarity in Large Collections: A MapReduce Perspective 13
Experimental Setup
0.16.0 Open source MapReduce implementation
Cluster of 19 machines Each w/ two processors (single core)
Aquaint-2 collection 906K documents
Okapi BM25 Subsets of collection
Elsayed, Lin, and Oard, ACL 2008Elsayed, Lin, and Oard, ACL 2008
Computing Pairwise Document Similarity in Large Collections: A MapReduce Perspective 14
Efficiency (disk space)
0
1,000
2,000
3,000
4,000
5,000
6,000
7,000
8,000
9,000
0 10 20 30 40 50 60 70 80 90 100
Corpus Size (%)
Inte
rme
dia
te P
air
s (
bill
ion
s)
8 trillion intermediate pairs
Hadoop, 19 PCs, each: 2 single-core processors, 4GB memory, 100GB disk
Aquaint-2 Collection, ~ 906k docs
Computing Pairwise Document Similarity in Large Collections: A MapReduce Perspective 15
Terms: Zipfian Distribution
term rank
do
c fr
eq (
df)
each term t contributes o(dft2) partial results
very few terms dominate the computations
most frequent term (“said”) 3%
most frequent 10 terms 15%
most frequent 100 terms 57%
most frequent 1000 terms 95%
~0.1% of total terms(99.9% df-cut)
Computing Pairwise Document Similarity in Large Collections: A MapReduce Perspective 16
Efficiency (disk space)
0
1,000
2,000
3,000
4,000
5,000
6,000
7,000
8,000
9,000
0 10 20 30 40 50 60 70 80 90 100
Corpus Size (%)
Inte
rmed
iate
Pai
rs (
bil
lio
ns)
no df-cutdf-cut at 99.999%df-cut at 99.99%df-cut at 99.9%df-cut at 99%
8 trillionintermediate pairs
0.5 trillion intermediate pairs
Hadoop, 19 PCs, each w/: 2 single-core processors, 4GB memory, 100GB disk
Aquaint-2 Collection, ~ 906k doc
Computing Pairwise Document Similarity in Large Collections: A MapReduce Perspective 17
Effectiveness (recent work) Effect of df-cut on effectiveness
Medline04 - 909k abstracts- Ad-hoc retrieval
50
55
60
65
70
75
80
85
90
95
100
99.00 99.10 99.20 99.30 99.40 99.50 99.60 99.70 99.80 99.90 100.00df-cut (%)
Re
lati
ve
P5
(%
)
Drop 0.1% of terms“Near-Linear” Growth
Fit on diskCost 2% in Effectiveness
Hadoop, 19 PCs, each w/: 2 single-core processors, 4GB memory, 100GB disk
Computing Pairwise Document Similarity in Large Collections: A MapReduce Perspective 18
Implementation Issues
BM25s Similarity Model
TF, IDF Document length
DF-Cut Build a histogram Pick the absolute df for the % df-cut
5.0
5.0log*
5.15.0*
5.15.0 11
1
11
1
df
dfN
dlavgdl
tf
tf
dlavgdl
tf
tf
Computing Pairwise Document Similarity in Large Collections: A MapReduce Perspective 19
Other Approximation Techniques ?
Computing Pairwise Document Similarity in Large Collections: A MapReduce Perspective 20
Other Approximation Techniques
(2) Absolute df
Consider only terms that appear in at least n (or %) documents An absolute lower bound on df, instead of just
removing the % most-frequent terms
Computing Pairwise Document Similarity in Large Collections: A MapReduce Perspective 21
Other Approximation Techniques
(3) tf-Cut
Consider only documents (in posting list) with tf > T ; T=1 or 2
OR: Consider only the top N documents based on tf for each term
Computing Pairwise Document Similarity in Large Collections: A MapReduce Perspective 22
Other Approximation Techniques
(4) Similarity Threshold
Consider only partial scores > SimT
Computing Pairwise Document Similarity in Large Collections: A MapReduce Perspective 23
Other Approximation Techniques:
(5) Ranked List
Keep only the most similar N documents In the reduce phase
Good for ad-hoc retrieval and “more-like this” queries
Computing Pairwise Document Similarity in Large Collections: A MapReduce Perspective 24
Space-Saving Tricks
(1) Stripes
Stripes instead of pairs Group by doc-id not pairs
11
222211
Computing Pairwise Document Similarity in Large Collections: A MapReduce Perspective 25
Space-Saving Tricks
(2) Blocking
No need to generate the whole matrix at once Generate different blocks of the matrix at
different steps limit the max space required for intermediate results
Similarity Matrix