ISchool, Cloud Computing Class Talk, Oct 6 th 2008 1 Computing Pairwise Document Similarity in Large...

iSchool, Cloud Computing Class Talk, Oct 6th 2008 1

Computing Pairwise Document Computing Pairwise Document Similarity in Large Collections:Similarity in Large Collections:

A MapReduce PerspectiveA MapReduce Perspective

Tamer Elsayed, Jimmy Lin, and Douglas W. OardTamer Elsayed, Jimmy Lin, and Douglas W. Oard

Computing Pairwise Document Similarity in Large Collections: A MapReduce Perspective 2

Overview

Abstract Problem Trivial Solution MapReduce Solution Efficiency Tricks

Abstract Problem

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

0.200.300.540.210.000.340.340.130.74

Applications: Clustering Coreference resolution “more-like-that” queries

Similarity of Documents

Simple inner product Cosine similarity Term weights

Standard problem in IR tf-idf, BM25, etc.

Trivial Solution

load each vector o(N) times load each term o(dft

2) times

scalable and efficient solutionfor large collections

Better Solution

Load weights for each term once Each term contributes o(dft

2) partial scores Allows efficiency tricks

Each term contributes only if appears in

Decomposition MapReduce

Load weights for each term once Each term contributes o(dft

2) partial scores

Each term contributes only if appears in

mapmap

indexindex

reducereduce

MapReduce Framework

mapmap

reducereduce

output

ShufflingShuffling

group values group values by: by: [[keyskeys]]

(a) Map(a) Map (b) Shuffle(b) Shuffle (c) Reduce(c) Reduce

handles low-level details transparentlytransparently

(k2, [v2])(k1, v1)

[(k3, v3)][k2, v2]

Standard Indexing

tokenizetokenize

combinecombine

posting list

ShufflingShuffling

group values group values by: by: termsterms

(a) Map(a) Map (b) Shuffle(b) Shuffle (c) Reduce(c) Reduce

Indexing (3-doc toy collection)

Clinton

Barack

Cheney

Indexing

ClintonObamaClinton 1

ClintonCheney

ClintonBarackObama

ClintonObamaClinton

ClintonCheney

ClintonBarackObama

Pairwise Similarity(a) Generate pairs(a) Generate pairs (b) Group pairs(b) Group pairs (c) Sum pairs(c) Sum pairs

Clinton

Barack

Cheney

Pairwise Similarity (abstract)(a) Generate pairs(a) Generate pairs (b) Group pairs(b) Group pairs (c) Sum pairs(c) Sum pairs

multiplymultiply

sumsum

term postings

similarity

ShufflingShuffling

group values group values by: by: pairspairs

Experimental Setup

0.16.0 Open source MapReduce implementation

Cluster of 19 machines Each w/ two processors (single core)

Aquaint-2 collection 906K documents

Okapi BM25 Subsets of collection

Elsayed, Lin, and Oard, ACL 2008Elsayed, Lin, and Oard, ACL 2008

Efficiency (disk space)

0 10 20 30 40 50 60 70 80 90 100

Corpus Size (%)

8 trillion intermediate pairs

Hadoop, 19 PCs, each: 2 single-core processors, 4GB memory, 100GB disk

Aquaint-2 Collection, ~ 906k docs

Terms: Zipfian Distribution

term rank

each term t contributes o(dft2) partial results

very few terms dominate the computations

most frequent term (“said”) 3%

most frequent 10 terms 15%

~0.1% of total terms(99.9% df-cut)

Efficiency (disk space)

0 10 20 30 40 50 60 70 80 90 100

Corpus Size (%)

no df-cutdf-cut at 99.999%df-cut at 99.99%df-cut at 99.9%df-cut at 99%

8 trillionintermediate pairs

0.5 trillion intermediate pairs

Hadoop, 19 PCs, each w/: 2 single-core processors, 4GB memory, 100GB disk

Aquaint-2 Collection, ~ 906k doc

Effectiveness (recent work) Effect of df-cut on effectiveness

Medline04 - 909k abstracts- Ad-hoc retrieval

99.00 99.10 99.20 99.30 99.40 99.50 99.60 99.70 99.80 99.90 100.00df-cut (%)

Drop 0.1% of terms“Near-Linear” Growth

Fit on diskCost 2% in Effectiveness

Hadoop, 19 PCs, each w/: 2 single-core processors, 4GB memory, 100GB disk

Implementation Issues

BM25s Similarity Model

TF, IDF Document length

DF-Cut Build a histogram Pick the absolute df for the % df-cut

5.0log*

5.15.0*

5.15.0 11

dlavgdl

Other Approximation Techniques ?

Other Approximation Techniques

(2) Absolute df

Consider only terms that appear in at least n (or %) documents An absolute lower bound on df, instead of just

removing the % most-frequent terms

(3) tf-Cut

Consider only documents (in posting list) with tf > T ; T=1 or 2

OR: Consider only the top N documents based on tf for each term

(4) Similarity Threshold

Consider only partial scores > SimT

Other Approximation Techniques:

(5) Ranked List

Keep only the most similar N documents In the reduce phase

Good for ad-hoc retrieval and “more-like this” queries

Space-Saving Tricks

(1) Stripes

Stripes instead of pairs Group by doc-id not pairs

222211

Space-Saving Tricks

(2) Blocking

No need to generate the whole matrix at once Generate different blocks of the matrix at

different steps limit the max space required for intermediate results

Similarity Matrix

ISchool, Cloud Computing Class Talk, Oct 6 th 2008 1 Computing Pairwise Document Similarity in Large...

Documents

Nada Elsayed-portfolio-

Elsayed Elsayed Hafez (IYSV) Egyptian Isolate , Ahmed A. … 2021. 1. 27. · Elsayed Elsayed Hafeza, Ahmed A. Abdelkhaleka, Abeer Salah El-Deen Abd El-Wahabb & Fatma Hassan Galalc

Bassam-Elsayed-Civil- Engineer

What is iSchool for the Future?

Ahmed Elsayed Ismail Ibrahim

Cloud Computing Lecture #1 Parallel and Distributed Computing Jimmy Lin The iSchool University of Maryland Monday, January 28, 2008 This work is licensed

Rutgers University, Industrial and Systems Engineering ......industrial and systems engineering distin-guished professor Elsayed A. Elsayed played a large part in.” For Elsayed,

DR.NOHA ELSAYED Cardiovascular Monitoring Electrocardiogram

Cloud Computing Lecture #5 Graph Algorithms with MapReduce Jimmy Lin The iSchool University of Maryland Wednesday, October 1, 2008 This work is licensed

CV Mohamed Elsayed Rev.091216

iSchool E-Club

SJSU iSchool Information Session: January 2015

Ahmed Elsayed - CV

UC Berkeley iSchool Nov, 2009

Gamal Elsayed Abdelaziz_Self-compacting Concrete

Dr. Abdullah M. Elsayed

Mahmoud Amin ElSayed

RELIABILITY ENGINEERING · Elsayed, Elsayed A. Reliability engineering / Elsayed A. Elsayed. – 2nd ed. cm.p. index.Includes 978-1-118-13719-2ISBN 1. Reliability (Engineering) I

Cloud Computing Lecture #2 Introduction to MapReduce Jimmy Lin The iSchool University of Maryland Monday, September 8, 2008 This work is licensed under

iSchool Presentation 9/26/11