Upload
elfreda-weaver
View
221
Download
1
Tags:
Embed Size (px)
Citation preview
On Scaling Latent Semantic Indexing for Large Peer-to-Peer Systems
Chunqiang Tang, Sandhya Dwarkadas, Zhichen Xu
University of Rochester; Yahoo! Inc.
ACM SIGIR2004Session: Dimensionality reduction
Abstract (1/2) Promising direction
Combine IR with peer-to-peer technology for scalability, fault-tolerance and low administration cost
pSearch Places docs onto a p2p overlay network
according to semantic vectors produced using Latent Semantic Indexing (LSI)
Limitation (inherits LSI) When the corpus is large, retrieval quality is bad The Singular Value Decomposition (SVD) in LSI is
unscalable in terms of both memory and time.
Abstract (2/2) Contributions
To reduce the cost of SVD, we reduce the size of its input matrix through doc clustering and term selection
Proper normalization of semantic vectors for terms and docs improves recall by 76%
To improve further improve retrieval quality, we use low-dimensional subvectors of semantic vectors to cluster documents in the overlay and then use Okapi to guide the search and doc selection
Introduction (1/3) Info. grow exponentially
Exceeds 10^18 bytes each year P2P systems
Scalability, fault-tolerance, self-organizing nature, raising hope for building large-scale IR systems
pSearch Populates docs in the network according to doc
semantics derived from LSI The search cost for a query is reduced to route
hops
Introduction (2/3) The limitations of pSearch
When the corpus is large, retrieval quality is bad The SVD that LSI uses to derive semantic vectors of
docs is not scalable in terms of memory consumption and computation time
Propose techniques to address these limitations eLSI (efficient LSI): doc clustering and term selection Proper normalization of semantic vector for terms and
docs improve recall by 76% LSI+Okapi: use low-dimensional subvectors of semantic
vector to implicitly cluster docs, and then use Okapi to guide search process and doc selection
Introduction (3/3) Contributions
Deriving low-dimensional representation for high-dimensional data is a common theme for many fields. ex: Principal Component Analysis (PCA), LSI
The proper configuration we found for LSI should be of general interest to the LSI community
Since nearest neighbor search in a high-dimensional space is prohibitive, we propose pSearch.
pSearch System Overview (1/4)
An example of how the system works
pSearch uses a CAN to organize Engine nodes into an extension of LSI to answer queries, called pLSI.
Vector Space Model (VSM)
ltc term weighting[log( ) 1] log( )ij iji
db f
D
2
1
ijij t
xjx
ba
b
pSearch System Overview (2/4)
Latent Semantic Indexing A: term-doc matrix,
rank=r LSI approximate A with a
rank-k matrix by omitting all but the k largest singular values
Content-Addressable Network (CAN)
CAN partitions a d-dimensional Cartesian space into zones and assign each zone to a node
TA U V T
k k k kA U V
1
Tk
Tk k
q U q
q U q
pSearch System Overview (3/4)
The pLSI Algorithm The pLSI algorithm combine
LSI and CAN to build pSearch Upon reaching the destination,
the query is flooded to nodes within a small radius r
Content-directed search algorithm: each node samples content stored on its neighbors and use them to decide which one to search next
LSI uses k=50~350 dimensional space for small corpora
pSearch System Overview (4/4)
Dimension mismatch between CAN and LSI The real dimension of a CAN can’t be higher than
l=O(log(n)) Partitions a k-dimensional semantic vector into multiple l-
dimension subvectors Given a doc, we store its index at p places in the CAN
using its first p subvectors as DHT keys (p=4) Two similar subvectors ensure
their full vectors are also similar accuracy = |A∩B|/|A| A: retrieve 15 docs for each
TREC7&8 based on 300 dimension semantic vectors
Improving Retrieval Quality (1/5)
Proper LSI Configuration Term normalization Doc normalization The choice of using to project vectors
Experiment SVDPACK Corpus: disk4 and 5 from TREC, 528,543 docs, 2GB Queries: the title field of topics 351-450 use ltc to generate the term-doc matrix for SVD Due to memory limitations, select only 15% of the
TREC corpus to construct a 83,098-term by 79,316-docs matrix to SVD, which project vectors into a 300-dimension space. (memory: 1.7GB, time: 57mins on 2GHz Pentium 4)
1
Tk
Tk k
q U q
q U q
Improving Retrieval Quality (2/5)
Improvement retrieve 1,000 docs for
each query and report the average # of relevant docs
return more 76% more relevant docs, when norm both
normalizing terms improves performance by emphasizing terms
normalizing docs corroborates the belief that cosine is a robust measure for similarity
Improving Retrieval Quality (3/5)
TREC vs. Medlars Corpus Medlars: 1,033 docs and 30
queries Docs and queries are projected
into a 50-dimension space 50-dimension is sufficient for
the small corpus 300-dimension is insufficient for the large corpus Normalization is beneficial if the dimension of the
semantic space are insufficient in capturing the fine structure of the corpus.
Improving Retrieval Quality (4/5)
LSI is bad for large corpus LSI does not exploit doc
length in ranking 300-dimension semantic
space is insufficient for TREC. LSI’s performance can be improved by increasing dimensionality.
LSI+Okapi use 4-plane pLSI (each
plane 25 dimensions) each plane retrieve 1,000
docs, use Okapi to rank the returned 4,000 docs
Improving Retrieval Quality (5/5)
Precision-recall for TREC Precision-recall for Medlars High-end precision for TREC
P@i: precision when retrieving i docs for a query
The performance of LSI+Okapi High-end precision approaches that of Okapi, but the
low-end still lags behind. The low-end precision can be improved by allowing
each plane to return more candidate docs for Okapi to rank, but this would increase the search cost.
Improving the Efficiency of LSI (1/)
Traditionally LSI use term-doc matrix as the input for SVD
for a matrix A≡Rt*d with about c nonzero elements per column, the time complexity of SVD is O(t*d*c)
The eLSI algorithm Use spherical k-means to cluster docs
C =[c1 c2 … cs] ≡Rt*s
The aggregate weight of a term i: we select a subset of e rows from matrix C to
construct a row-reduced matrix e: top e terms with the largest aggregate weight
1
s
i ijjw c
e*R sC
Improving the Efficiency of LSI (2/)
For TREC corpus The complete term-doc
matrix has 408,653 rows and 528,155 columns
The matrix has less than 2,000 rows and 2,000 cols
Projection Projects terms into the
semantic space using Vk Project a doc (or query)
vector q into the semantic space and normalize it to unit length
T
Tk k k
C U V
C U V
C
*
*1
2
t kk
T
k
B CV R
q B q
qq R
q
Improving the Efficiency of LSI (3/)
Other Dimensionality Reduction Methods Random Projection (RP)
The first step of all other algorithms partitions docs into k clusters, G=[g1 g2 … gk] ≡Rt*k
Concept Indexing (CI)
The third algorithm solves the least-squares problem
QR decomposition
* t k TP R q P q
Tq G q
2arg min
qq Gq q
q0 0
Tk r k k
R RG Q Q Q Q R Q q
Improving the Efficiency of LSI (4/)
RP-eLSIF is a random matrix
Comparing Dimension Reduction Methods
RP performs well when the dim of the reduced space is sufficient in capturing the real dim of the data
* T t eC F C F R