On Scaling Latent Semantic Indexing for Large Peer-to-Peer Systems Chunqiang Tang, Sandhya Dwarkadas, Zhichen Xu University of Rochester; Yahoo! Inc. ACM

On Scaling Latent Semantic Indexing for Large Peer-to-Peer Systems

Chunqiang Tang, Sandhya Dwarkadas, Zhichen Xu

University of Rochester; Yahoo! Inc.

ACM SIGIR2004Session: Dimensionality reduction

Abstract (1/2) Promising direction

Combine IR with peer-to-peer technology for scalability, fault-tolerance and low administration cost

pSearch Places docs onto a p2p overlay network

according to semantic vectors produced using Latent Semantic Indexing (LSI)

Limitation (inherits LSI) When the corpus is large, retrieval quality is bad The Singular Value Decomposition (SVD) in LSI is

unscalable in terms of both memory and time.

Abstract (2/2) Contributions

To reduce the cost of SVD, we reduce the size of its input matrix through doc clustering and term selection

Proper normalization of semantic vectors for terms and docs improves recall by 76%

To improve further improve retrieval quality, we use low-dimensional subvectors of semantic vectors to cluster documents in the overlay and then use Okapi to guide the search and doc selection

Introduction (1/3) Info. grow exponentially

Exceeds 10^18 bytes each year P2P systems

Scalability, fault-tolerance, self-organizing nature, raising hope for building large-scale IR systems

pSearch Populates docs in the network according to doc

semantics derived from LSI The search cost for a query is reduced to route

hops

Introduction (2/3) The limitations of pSearch

When the corpus is large, retrieval quality is bad The SVD that LSI uses to derive semantic vectors of

docs is not scalable in terms of memory consumption and computation time

Propose techniques to address these limitations eLSI (efficient LSI): doc clustering and term selection Proper normalization of semantic vector for terms and

docs improve recall by 76% LSI+Okapi: use low-dimensional subvectors of semantic

vector to implicitly cluster docs, and then use Okapi to guide search process and doc selection

Introduction (3/3) Contributions

Deriving low-dimensional representation for high-dimensional data is a common theme for many fields. ex: Principal Component Analysis (PCA), LSI

The proper configuration we found for LSI should be of general interest to the LSI community

Since nearest neighbor search in a high-dimensional space is prohibitive, we propose pSearch.

pSearch System Overview (1/4)

An example of how the system works

pSearch uses a CAN to organize Engine nodes into an extension of LSI to answer queries, called pLSI.

Vector Space Model (VSM)

ltc term weighting[log( ) 1] log( )ij iji

db f

D

2

1

ijij t

xjx

ba

b


Latent Semantic Indexing A: term-doc matrix,

rank=r LSI approximate A with a

rank-k matrix by omitting all but the k largest singular values

Content-Addressable Network (CAN)

CAN partitions a d-dimensional Cartesian space into zones and assign each zone to a node

TA U V T

k k k kA U V

1

Tk

Tk k

q U q

q U q


The pLSI Algorithm The pLSI algorithm combine

LSI and CAN to build pSearch Upon reaching the destination,

the query is flooded to nodes within a small radius r

Content-directed search algorithm: each node samples content stored on its neighbors and use them to decide which one to search next

LSI uses k=50~350 dimensional space for small corpora


Dimension mismatch between CAN and LSI The real dimension of a CAN can’t be higher than

l=O(log(n)) Partitions a k-dimensional semantic vector into multiple l-

dimension subvectors Given a doc, we store its index at p places in the CAN

using its first p subvectors as DHT keys (p=4) Two similar subvectors ensure

their full vectors are also similar accuracy = |A∩B|/|A| A: retrieve 15 docs for each

TREC7&8 based on 300 dimension semantic vectors

Improving Retrieval Quality (1/5)

Proper LSI Configuration Term normalization Doc normalization The choice of using to project vectors

Experiment SVDPACK Corpus: disk4 and 5 from TREC, 528,543 docs, 2GB Queries: the title field of topics 351-450 use ltc to generate the term-doc matrix for SVD Due to memory limitations, select only 15% of the

TREC corpus to construct a 83,098-term by 79,316-docs matrix to SVD, which project vectors into a 300-dimension space. (memory: 1.7GB, time: 57mins on 2GHz Pentium 4)

1

Tk

Tk k

q U q

q U q


Improvement retrieve 1,000 docs for

each query and report the average # of relevant docs

return more 76% more relevant docs, when norm both

normalizing terms improves performance by emphasizing terms

normalizing docs corroborates the belief that cosine is a robust measure for similarity


TREC vs. Medlars Corpus Medlars: 1,033 docs and 30

queries Docs and queries are projected

into a 50-dimension space 50-dimension is sufficient for

the small corpus 300-dimension is insufficient for the large corpus Normalization is beneficial if the dimension of the

semantic space are insufficient in capturing the fine structure of the corpus.


LSI is bad for large corpus LSI does not exploit doc

length in ranking 300-dimension semantic

space is insufficient for TREC. LSI’s performance can be improved by increasing dimensionality.

LSI+Okapi use 4-plane pLSI (each

plane 25 dimensions) each plane retrieve 1,000

docs, use Okapi to rank the returned 4,000 docs


Precision-recall for TREC Precision-recall for Medlars High-end precision for TREC

P@i: precision when retrieving i docs for a query

The performance of LSI+Okapi High-end precision approaches that of Okapi, but the

low-end still lags behind. The low-end precision can be improved by allowing

each plane to return more candidate docs for Okapi to rank, but this would increase the search cost.

Improving the Efficiency of LSI (1/)

Traditionally LSI use term-doc matrix as the input for SVD

for a matrix A≡Rt*d with about c nonzero elements per column, the time complexity of SVD is O(t*d*c)

The eLSI algorithm Use spherical k-means to cluster docs

C =[c1 c2 … cs] ≡Rt*s

The aggregate weight of a term i: we select a subset of e rows from matrix C to

construct a row-reduced matrix e: top e terms with the largest aggregate weight

1

s

i ijjw c

e*R sC


For TREC corpus The complete term-doc

matrix has 408,653 rows and 528,155 columns

The matrix has less than 2,000 rows and 2,000 cols

Projection Projects terms into the

semantic space using Vk Project a doc (or query)

vector q into the semantic space and normalize it to unit length

T

Tk k k

C U V

C U V

C

*

*1

2

t kk

T

k

B CV R

q B q

qq R

q


Other Dimensionality Reduction Methods Random Projection (RP)

The first step of all other algorithms partitions docs into k clusters, G=[g1 g2 … gk] ≡Rt*k

Concept Indexing (CI)

The third algorithm solves the least-squares problem

QR decomposition

* t k TP R q P q

Tq G q

2arg min

qq Gq q

q0 0

Tk r k k

R RG Q Q Q Q R Q q


RP-eLSIF is a random matrix

Comparing Dimension Reduction Methods

RP performs well when the dim of the reduced space is sufficient in capturing the real dim of the data

* T t eC F C F R

Documents

On Scaling Latent Semantic Indexing for Large Peer-to-Peer Systems Chunqiang Tang, Sandhya Dwarkadas, Zhichen Xu University of Rochester; Yahoo! Inc. ACM