ISchool, Cloud Computing Class Talk, Oct 6 th 2008 1 Computing Pairwise Document Similarity in Large...

Preview:

Citation preview

iSchool, Cloud Computing Class Talk, Oct 6th 2008 1

Computing Pairwise Document Computing Pairwise Document Similarity in Large Collections:Similarity in Large Collections:

A MapReduce PerspectiveA MapReduce Perspective

Tamer Elsayed, Jimmy Lin, and Douglas W. OardTamer Elsayed, Jimmy Lin, and Douglas W. Oard

Computing Pairwise Document Similarity in Large Collections: A MapReduce Perspective 2

Overview

Abstract Problem Trivial Solution MapReduce Solution Efficiency Tricks

Computing Pairwise Document Similarity in Large Collections: A MapReduce Perspective 3

Abstract Problem

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

0.200.300.540.210.000.340.340.130.74

0.200.300.540.210.000.340.340.130.74

0.200.300.540.210.000.340.340.130.74

0.200.300.540.210.000.340.340.130.74

0.200.300.540.210.000.340.340.130.74

Applications: Clustering Coreference resolution “more-like-that” queries

Computing Pairwise Document Similarity in Large Collections: A MapReduce Perspective 4

Similarity of Documents

Simple inner product Cosine similarity Term weights

Standard problem in IR tf-idf, BM25, etc.

di

dj

Computing Pairwise Document Similarity in Large Collections: A MapReduce Perspective 5

Trivial Solution

load each vector o(N) times load each term o(dft

2) times

scalable and efficient solutionfor large collections

Goal

Computing Pairwise Document Similarity in Large Collections: A MapReduce Perspective 6

Better Solution

Load weights for each term once Each term contributes o(dft

2) partial scores Allows efficiency tricks

Each term contributes only if appears in

Computing Pairwise Document Similarity in Large Collections: A MapReduce Perspective 7

Decomposition MapReduce

Load weights for each term once Each term contributes o(dft

2) partial scores

Each term contributes only if appears in

mapmap

indexindex

reducereduce

Computing Pairwise Document Similarity in Large Collections: A MapReduce Perspective 8

MapReduce Framework

mapmap

mapmap

mapmap

mapmap

reducereduce

reducereduce

reducereduce

input

input

input

input

output

output

output

ShufflingShuffling

group values group values by: by: [[keyskeys]]

(a) Map(a) Map (b) Shuffle(b) Shuffle (c) Reduce(c) Reduce

handles low-level details transparentlytransparently

(k2, [v2])(k1, v1)

[(k3, v3)][k2, v2]

Computing Pairwise Document Similarity in Large Collections: A MapReduce Perspective 9

Standard Indexing

tokenizetokenize

tokenizetokenize

tokenizetokenize

tokenizetokenize

combinecombine

combinecombine

combinecombine

doc

doc

doc

doc

posting list

posting list

posting list

ShufflingShuffling

group values group values by: by: termsterms

(a) Map(a) Map (b) Shuffle(b) Shuffle (c) Reduce(c) Reduce

Computing Pairwise Document Similarity in Large Collections: A MapReduce Perspective 10

Indexing (3-doc toy collection)

Clinton

Barack

Cheney

Obama

Indexing

2

1

1

1

1

ClintonObamaClinton 1

1

ClintonCheney

ClintonBarackObama

ClintonObamaClinton

ClintonCheney

ClintonBarackObama

Computing Pairwise Document Similarity in Large Collections: A MapReduce Perspective 11

Pairwise Similarity(a) Generate pairs(a) Generate pairs (b) Group pairs(b) Group pairs (c) Sum pairs(c) Sum pairs

Clinton

Barack

Cheney

Obama

2

1

1

1

1

1

1

22

22

11

1111

22

22 22

22

11

1133

11

Computing Pairwise Document Similarity in Large Collections: A MapReduce Perspective 12

Pairwise Similarity (abstract)(a) Generate pairs(a) Generate pairs (b) Group pairs(b) Group pairs (c) Sum pairs(c) Sum pairs

multiplymultiply

multiplymultiply

multiplymultiply

multiplymultiply

sumsum

sumsum

sumsum

term postings

term postings

term postings

term postings

similarity

similarity

similarity

ShufflingShuffling

group values group values by: by: pairspairs

Computing Pairwise Document Similarity in Large Collections: A MapReduce Perspective 13

Experimental Setup

0.16.0 Open source MapReduce implementation

Cluster of 19 machines Each w/ two processors (single core)

Aquaint-2 collection 906K documents

Okapi BM25 Subsets of collection

Elsayed, Lin, and Oard, ACL 2008Elsayed, Lin, and Oard, ACL 2008

Computing Pairwise Document Similarity in Large Collections: A MapReduce Perspective 14

Efficiency (disk space)

0

1,000

2,000

3,000

4,000

5,000

6,000

7,000

8,000

9,000

0 10 20 30 40 50 60 70 80 90 100

Corpus Size (%)

Inte

rme

dia

te P

air

s (

bill

ion

s)

8 trillion intermediate pairs

Hadoop, 19 PCs, each: 2 single-core processors, 4GB memory, 100GB disk

Aquaint-2 Collection, ~ 906k docs

Computing Pairwise Document Similarity in Large Collections: A MapReduce Perspective 15

Terms: Zipfian Distribution

term rank

do

c fr

eq (

df)

each term t contributes o(dft2) partial results

very few terms dominate the computations

most frequent term (“said”) 3%

most frequent 10 terms 15%

most frequent 100 terms 57%

most frequent 1000 terms 95%

~0.1% of total terms(99.9% df-cut)

Computing Pairwise Document Similarity in Large Collections: A MapReduce Perspective 16

Efficiency (disk space)

0

1,000

2,000

3,000

4,000

5,000

6,000

7,000

8,000

9,000

0 10 20 30 40 50 60 70 80 90 100

Corpus Size (%)

Inte

rmed

iate

Pai

rs (

bil

lio

ns)

no df-cutdf-cut at 99.999%df-cut at 99.99%df-cut at 99.9%df-cut at 99%

8 trillionintermediate pairs

0.5 trillion intermediate pairs

Hadoop, 19 PCs, each w/: 2 single-core processors, 4GB memory, 100GB disk

Aquaint-2 Collection, ~ 906k doc

Computing Pairwise Document Similarity in Large Collections: A MapReduce Perspective 17

Effectiveness (recent work) Effect of df-cut on effectiveness

Medline04 - 909k abstracts- Ad-hoc retrieval

50

55

60

65

70

75

80

85

90

95

100

99.00 99.10 99.20 99.30 99.40 99.50 99.60 99.70 99.80 99.90 100.00df-cut (%)

Re

lati

ve

P5

(%

)

Drop 0.1% of terms“Near-Linear” Growth

Fit on diskCost 2% in Effectiveness

Hadoop, 19 PCs, each w/: 2 single-core processors, 4GB memory, 100GB disk

Computing Pairwise Document Similarity in Large Collections: A MapReduce Perspective 18

Implementation Issues

BM25s Similarity Model

TF, IDF Document length

DF-Cut Build a histogram Pick the absolute df for the % df-cut

5.0

5.0log*

5.15.0*

5.15.0 11

1

11

1

df

dfN

dlavgdl

tf

tf

dlavgdl

tf

tf

Computing Pairwise Document Similarity in Large Collections: A MapReduce Perspective 19

Other Approximation Techniques ?

Computing Pairwise Document Similarity in Large Collections: A MapReduce Perspective 20

Other Approximation Techniques

(2) Absolute df

Consider only terms that appear in at least n (or %) documents An absolute lower bound on df, instead of just

removing the % most-frequent terms

Computing Pairwise Document Similarity in Large Collections: A MapReduce Perspective 21

Other Approximation Techniques

(3) tf-Cut

Consider only documents (in posting list) with tf > T ; T=1 or 2

OR: Consider only the top N documents based on tf for each term

Computing Pairwise Document Similarity in Large Collections: A MapReduce Perspective 22

Other Approximation Techniques

(4) Similarity Threshold

Consider only partial scores > SimT

Computing Pairwise Document Similarity in Large Collections: A MapReduce Perspective 23

Other Approximation Techniques:

(5) Ranked List

Keep only the most similar N documents In the reduce phase

Good for ad-hoc retrieval and “more-like this” queries

Computing Pairwise Document Similarity in Large Collections: A MapReduce Perspective 24

Space-Saving Tricks

(1) Stripes

Stripes instead of pairs Group by doc-id not pairs

11

222211

Computing Pairwise Document Similarity in Large Collections: A MapReduce Perspective 25

Space-Saving Tricks

(2) Blocking

No need to generate the whole matrix at once Generate different blocks of the matrix at

different steps limit the max space required for intermediate results

Similarity Matrix

Recommended