33
Variable Latent Semantic Indexing Prabhakar Raghavan Yahoo! Research Sunnyvale, CA November 2005 Joint work with A. Dasgupta, R. Kumar, A. Tomkins. Yahoo! Research. 1

Variable Latent Semantic Indexing · 2005. 11. 8. · Variable Latent Semantic Indexing Probability distribution Q over set of terms. Query vector q chosen according to Q. Want eA

  • Upload
    others

  • View
    6

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Variable Latent Semantic Indexing · 2005. 11. 8. · Variable Latent Semantic Indexing Probability distribution Q over set of terms. Query vector q chosen according to Q. Want eA

Variable Latent Semantic Indexing

Prabhakar Raghavan

Yahoo! ResearchSunnyvale, CA

November 2005

Joint work with A. Dasgupta, R. Kumar, A. Tomkins.Yahoo! Research.

1

Page 2: Variable Latent Semantic Indexing · 2005. 11. 8. · Variable Latent Semantic Indexing Probability distribution Q over set of terms. Query vector q chosen according to Q. Want eA

Outline

1 Introduction

2 Background

3 Variable Latent Semantic Indexing

4 Experiments

2

Page 3: Variable Latent Semantic Indexing · 2005. 11. 8. · Variable Latent Semantic Indexing Probability distribution Q over set of terms. Query vector q chosen according to Q. Want eA

Outline

1 Introduction

2 Background

3 Variable Latent Semantic Indexing

4 Experiments

3

Page 4: Variable Latent Semantic Indexing · 2005. 11. 8. · Variable Latent Semantic Indexing Probability distribution Q over set of terms. Query vector q chosen according to Q. Want eA

Searching Text Corpora

Word CountApple 10. . .

Drivers 12Oranges 0. . .

Tiger 20Widget 5

4

Page 5: Variable Latent Semantic Indexing · 2005. 11. 8. · Variable Latent Semantic Indexing Probability distribution Q over set of terms. Query vector q chosen according to Q. Want eA

Term-Document Matrices

t t1

t4t3

2

document

document

queryquery

Each term is a dimension.Each document is a vector overterms.Query is a vector over terms.Weighting schemes

Boolean, Okapi, TF-IDF etc.

Document “similarity"≈ closeness in term-space.

5

Page 6: Variable Latent Semantic Indexing · 2005. 11. 8. · Variable Latent Semantic Indexing Probability distribution Q over set of terms. Query vector q chosen according to Q. Want eA

Document Similarity

1

111

0

00 0

00

01

1

1query

document1

2

....

....

....document

documentn

A01 1

Term-document matrix A,query vector q.Document relevance toquery given by (weighted)number of terms incommon.Relevance scores given byqT A.

6

Page 7: Variable Latent Semantic Indexing · 2005. 11. 8. · Variable Latent Semantic Indexing Probability distribution Q over set of terms. Query vector q chosen according to Q. Want eA

Outline

1 Introduction

2 Background

3 Variable Latent Semantic Indexing

4 Experiments

7

Page 8: Variable Latent Semantic Indexing · 2005. 11. 8. · Variable Latent Semantic Indexing Probability distribution Q over set of terms. Query vector q chosen according to Q. Want eA

LSI at a high level

Term-document matrix A.Want a representation A such that

A preserves semantic associations.uses less resources.

Goal : measuring query-document similarity using A isefficient and gives better results.

Basic intuition of SVD/LSI used in clustering, collaborativefiltering.

8

Page 9: Variable Latent Semantic Indexing · 2005. 11. 8. · Variable Latent Semantic Indexing Probability distribution Q over set of terms. Query vector q chosen according to Q. Want eA

Singular Value Decomposition

Singular Value Decomposition of a 3× 3 matrix

A = U × Σ × V T . . .. . .. . .

=

. . .. . .. . .

σ1σ2

σ3

. . .. . .. . .

9

Page 10: Variable Latent Semantic Indexing · 2005. 11. 8. · Variable Latent Semantic Indexing Probability distribution Q over set of terms. Query vector q chosen according to Q. Want eA

Latent Semantic Indexing

terms

documents

σ1 σ2UV

...T

=

Suppose only k topics indata.Keep top k singular valuesand vectors.Denoted as A(k).

A(k) is “closest” rank-kmatrix to A.

“closest” ≡ Frobenius,L2 norms.

10

Page 11: Variable Latent Semantic Indexing · 2005. 11. 8. · Variable Latent Semantic Indexing Probability distribution Q over set of terms. Query vector q chosen according to Q. Want eA

Latent Semantic Indexing

terms

documents

σ1 σ2UV

...T

=

Suppose only k topics indata.Keep top k singular valuesand vectors.Denoted as A(k).A(k) is “closest” rank-kmatrix to A.

“closest” ≡ Frobenius,L2 norms.

11

Page 12: Variable Latent Semantic Indexing · 2005. 11. 8. · Variable Latent Semantic Indexing Probability distribution Q over set of terms. Query vector q chosen according to Q. Want eA

LSI in Answering Queries

Propose A(k) as the representation A.

Space reduced by factor kavg(#terms) .

“Dimensionality” of corpus ≡ number of topics in corpus.

Results in “cleaner” representation of structure by ignoring“irrelevant” axes.

Believed to identify synonymous termse.g. car and automobile.

Disambiguate based on context.

12

Page 13: Variable Latent Semantic Indexing · 2005. 11. 8. · Variable Latent Semantic Indexing Probability distribution Q over set of terms. Query vector q chosen according to Q. Want eA

Latent Semantic Indexing

t t1

t4

t3

2

documents in A

documents in A~

Finds best rank-ksubspace “fitting”the documents.

A = A(k).

13

Page 14: Variable Latent Semantic Indexing · 2005. 11. 8. · Variable Latent Semantic Indexing Probability distribution Q over set of terms. Query vector q chosen according to Q. Want eA

Motivating Variable LSI

t t1

t4

t3

2

query

documents in A~

documents in AFinds best rank-ksubspace “fitting”the documents.

A = A(k).

But, we weredealing withanswering queries ?

14

Page 15: Variable Latent Semantic Indexing · 2005. 11. 8. · Variable Latent Semantic Indexing Probability distribution Q over set of terms. Query vector q chosen according to Q. Want eA

Outline

1 Introduction

2 Background

3 Variable Latent Semantic Indexing

4 Experiments

15

Page 16: Variable Latent Semantic Indexing · 2005. 11. 8. · Variable Latent Semantic Indexing Probability distribution Q over set of terms. Query vector q chosen according to Q. Want eA

Query Distribution

Query vectors have a skewed distribution over terms.In a corpus, might get queries only for a small subset ofterms.

e.g.: Queries about sport and politics ...

Co-occurence between query terms ?e.g. “data” + “mining”.

Ad-hoc solution: delete irrelevant terms.Any principled approach ?

16

Page 17: Variable Latent Semantic Indexing · 2005. 11. 8. · Variable Latent Semantic Indexing Probability distribution Q over set of terms. Query vector q chosen according to Q. Want eA

Variable Latent Semantic Indexing

Probability distribution Q over set of terms.Query vector q chosen according to Q.Want A such that for most such vectors q,

qT A ≈ qT A.

First cut : minimize expectation of ‖qT (A− A)‖.

17

Page 18: Variable Latent Semantic Indexing · 2005. 11. 8. · Variable Latent Semantic Indexing Probability distribution Q over set of terms. Query vector q chosen according to Q. Want eA

Variable Latent Semantic Indexing

Probability distribution Q over set of terms.Query vector q chosen according to Q.Want A such that for most such vectors q,

qT A ≈ qT A.

First cut : minimize expectation of ‖qT (A− A)‖.

18

Page 19: Variable Latent Semantic Indexing · 2005. 11. 8. · Variable Latent Semantic Indexing Probability distribution Q over set of terms. Query vector q chosen according to Q. Want eA

Isotropic Query Distribution

t t1

t4t3

2

random query

document

document

Query distribution hasuniformly random direction.Expected ‖qT (A− A)‖ isminimized at

A = A(k).

19

Page 20: Variable Latent Semantic Indexing · 2005. 11. 8. · Variable Latent Semantic Indexing Probability distribution Q over set of terms. Query vector q chosen according to Q. Want eA

Skewed Query Distribution

t t1

t4t3

2

document

document

random queryvector

Need to skew rank-kapproximation tomatch querydistribution.

20

Page 21: Variable Latent Semantic Indexing · 2005. 11. 8. · Variable Latent Semantic Indexing Probability distribution Q over set of terms. Query vector q chosen according to Q. Want eA

Variable Latent Semantic Indexing

RecallA: term-document matrix, Q: query distribution.

Co-occurrence matrix:

C = Exq∼Q

[qqT

]X = C1/2A.

Find rank-k approximation X (k) of X .

Return A = C−1/2X (k).

21

Page 22: Variable Latent Semantic Indexing · 2005. 11. 8. · Variable Latent Semantic Indexing Probability distribution Q over set of terms. Query vector q chosen according to Q. Want eA

Proof Intuition

t4

original term-document space

22

Page 23: Variable Latent Semantic Indexing · 2005. 11. 8. · Variable Latent Semantic Indexing Probability distribution Q over set of terms. Query vector q chosen according to Q. Want eA

Proof Intuition

original term-document space transformed term space

X=C A1/2

23

Page 24: Variable Latent Semantic Indexing · 2005. 11. 8. · Variable Latent Semantic Indexing Probability distribution Q over set of terms. Query vector q chosen according to Q. Want eA

Proof Intuition

low-rank space in transformed term space

original term-document space transformed term space

X=C A1/2

X(k)

24

Page 25: Variable Latent Semantic Indexing · 2005. 11. 8. · Variable Latent Semantic Indexing Probability distribution Q over set of terms. Query vector q chosen according to Q. Want eA

Proof Intuition

return to original space low-rank space in transformed term space

original term-document space transformed term space

X=C A1/2

X(k)

X(k)-1/2

C

25

Page 26: Variable Latent Semantic Indexing · 2005. 11. 8. · Variable Latent Semantic Indexing Probability distribution Q over set of terms. Query vector q chosen according to Q. Want eA

Outline

1 Introduction

2 Background

3 Variable Latent Semantic Indexing

4 Experiments

26

Page 27: Variable Latent Semantic Indexing · 2005. 11. 8. · Variable Latent Semantic Indexing Probability distribution Q over set of terms. Query vector q chosen according to Q. Want eA

Experimental Setup

Reuters data (1987).21k documents.Five categories.112k terms.134 terms per document.

preprocessingporter-stemmed, case-folded and stop-worded.term-document matrices with boolean, okapi weighting.

used svdpackc.

27

Page 28: Variable Latent Semantic Indexing · 2005. 11. 8. · Variable Latent Semantic Indexing Probability distribution Q over set of terms. Query vector q chosen according to Q. Want eA

Experimental Setup: Query Distribution

Single-wordterms distributed according to frequency in corpus.power law, ordered by distribution in corpus.power law on random ordering.

Two topics:money, commodities

Double-word: power law on ranked bigrams.

28

Page 29: Variable Latent Semantic Indexing · 2005. 11. 8. · Variable Latent Semantic Indexing Probability distribution Q over set of terms. Query vector q chosen according to Q. Want eA

VLSI Results: L2 error

Typically, saves anorder of magnitudein dimensions for L2error.

29

Page 30: Variable Latent Semantic Indexing · 2005. 11. 8. · Variable Latent Semantic Indexing Probability distribution Q over set of terms. Query vector q chosen according to Q. Want eA

Results: Competitive Error

comp. error = 1 −comp. precision.Substantialimprovements forcompetitive errortoo.

30

Page 31: Variable Latent Semantic Indexing · 2005. 11. 8. · Variable Latent Semantic Indexing Probability distribution Q over set of terms. Query vector q chosen according to Q. Want eA

Summary

LSI does effective dimension reduction, but can befine-tuned using VLSI to query distributions.Space requirements same as that of LSI.Have to estimate co-occurrence matrix.

31

Page 32: Variable Latent Semantic Indexing · 2005. 11. 8. · Variable Latent Semantic Indexing Probability distribution Q over set of terms. Query vector q chosen according to Q. Want eA

Future Work

Personalized versions ?Analyze retrieval using stochastic data models ?Computational issues

Using sampling for efficiency ?Updating using query streams ?

Application to other domains?

32

Page 33: Variable Latent Semantic Indexing · 2005. 11. 8. · Variable Latent Semantic Indexing Probability distribution Q over set of terms. Query vector q chosen according to Q. Want eA

Thanks !

33