Latent Semantic Indexinglsir · 2007-01-15 · 32 ©2006/7, Karl Aberer, EPFL-IC, Laboratoire de systèmes d'informations répartis Information Retrieval - 32 Latent Semantic Indexing

32

©2006/7, Karl Aberer, EPFL-IC, Laboratoire de systèmes d'informations répartis Information Retrieval - 32

Latent Semantic Indexing

• Vector Space Retrieval might lead to poor retrieval– Unrelated documents might be included in the answer set– Relevant documents that do not contain at least one index term are not

retrieved

• Reasoning– retrieval based on index terms is vague and noisy– The user information need is more related to concepts and ideas than to

index terms

• Key Idea: map documents and queries into a lower dimensional space composed of higher level concepts which are fewer in number than the index terms

• Dimensionality reduction: Retrieval (and clustering) in a reduced concept space might be superior to retrieval in the high-dimensional space of index terms

Despite its success the vector model suffers some problems. Unrelateddocuments may be retrieved simply because terms occur accidentally in it, and on the other hand related documents may be missed because no term in the document occurs in the query (consider synonyms, thereexists a study that different people use the same keywords for expressing the same concepts only 20% of the time). Thus it would bean interesting idea to see whether the retrieval could be based on concepts rather than on terms, by mapping first terms to a "concept space" (and queries as well) and then establish the ranking with respect to similarity within the concept space. This idea is explored in the following.

33


Using Concepts for Retrieval

t1

t2

t3

d1

d2

d3

d4

t1

t2

t3

d1

d2

d3

d4

c1

c2

This illustrates the approach: rather than directly relating documents and terms as in vector retrieval, there exists a middle layer into which bothqueries and documents map. The space of concepts can be of smallerdimension. For example, we could determine that the query t3 returnsd2, d3, d4 in the answer set based on the observation that they relate to concept c2, without requiring that the document contains term d3. The question is, of how to obtain such a concept space. One possible waywould be to find canonical representations of natural language, but thisis a difficult task to achieve. Much simpler, we could try to use mathematical properties of the term-document matrix, i.e. determine the concepts by matrix compuation.

34


Basic Definitions

• Problem: how to identify and compute “concepts” ?

• Consider the term-document matrix– Let Mij be a term-document matrix with t rows (terms) and N columns

(documents)– To each element of this matrix is assigned a weight wij

associated with ki and dj– The weight wij can be based on a tf-idf weighting scheme

35


Computing the Ranking Using M

quer

y

doc1

doc2

doc3

doc4

doc5doc6

query•doc1

query•doc2

...

query•doc6

Mt qMt • q

m

m

n

This figure illustrates how the term-document matrix M can be used to compute the ranking of the documents with respect to a query q. (The columns in M and q would have to be normalized to 1)

36


Singular Value Decomposition

• Key Idea: extract the essential features of Mt and approximate it by the most important ones

• Singular Value Decomposition (SVD)– M = K • S • Dt

– K and D are matrices with orthonormal columns– S is an r x r diagonal matrix of the singular values sorted in decreasing

order where r = min(m, n), i.e. the rank of M– Such a decomposition always exists and is unique (up to sign)

• Construction of SVD– K is the matrix of eigenvectors derived from M • Mt

– Dt is the matrix of eigenvectors derived from Mt • M– Algorithms for constructing the SVD of a m× n matrix have complexity O(n3)

if m ≈ n

For extracting "conceptual features" a mathematical construction fromlinear algebra is used, the singular value decomposition (SVD). It decomposes a matrix into the product of three matrices. The middle matrix is a diagonal matrix, where the elements of this matrix are singular values. This decomposition can always be constructed in O(n^3). Note that the complexity is considerable, which makes the approach computationally expensive. There exist however alsoapproximation techniques to compute S more efficiently.

37


Illustration of Singular Value Decomposition

• •=M K S Dt

m x n m x r r x r r x n

Assuming n ≤ m

m

rn

n

r r

38


Interpretation of SVD

• First interpretation: we can write

• The si are ordered in decreasing size, thus by taking only the largestones we obtain a good "approximation" of M

• Second interpretation: the singular values si are the lengths of the semi-axes of the hyperellipsoid E defined by

• Each value si corresponds to a dimension of a "concept space"

• Third interpretation: the SVD is a least square approximation

1

r ti i ii

M s k d=

=∑

{ }2| 1E Mx x= =

The singular value decomposition is an extremely useful construction to reveal properties of a matrix. This is illustrated by the followingfacts:

1. We can write the matrix as the sum of components weighted by the singular values, thus we can obtain approximations of the matrix by only considering the larger singular values.

2. The singular values have also a geometrical interpretation, as theytell us how a unit ball (||x||=1) is distorted by the multiplication withthe matrix M. Thus we can view the axes of the hyperellipsoid E as the dimensions of the concept space.

3. The SVD after eliminating less important dimensions (smallersingular values) can be interpreted as a least square approximation to the original matrix.

39


Latent Semantic Indexing

• In the matrix S, select only the s largest singular values– Keep the corresponding columns in K and Dt

• The resultant matrix is called Ms and is given by– Ms = Ks • Ss • Ds

t

where s, s < r, is the dimensionality of the concept space

• The parameter s should be– large enough to allow fitting the characteristics of the data– small enough to filter out the non-relevant representational details

Using the singular value decomposition, we can now derive an "approximation" of M by taking only the s largest singular values in matrix S. The choice of s determines on how many of the "important concepts" the ranking will be based on. The assumption is that concepts with small singular value in S are rather to be considered as "noise" and thus can be neglected.

40


Illustration of Latent Semantic Indexing

• •=Ms Ks Ss

Term vectors

Document vectors

m x n m x s s x s s x n

n

m

s

Dst

n

This illustrates of how the sizes of the involved matrices reduce, whenonly the first s singular values are kept for the computation of the ranking. The lines in matrix K_s correspond to term vectors, whereasthe columns in matrix D_s^t correspond to document vectors. By usingthe cosine similarity measure between columns the similarity of the documents can be evaluated.

41


Answering Queries

• Documents can be compared by computing cosine similarity in the "document space", i.e. comparing their rows di and dj in matrix Ds

t

• A query q is treated like one further document– it is added like an additional column to matrix M– the same transformation is applied as for mapping M to D

• Mapping of M to D– M = K • S • Dt ⇒ S-1• Kt • M = Dt (since K • Kt = 1) ⇒ D = Mt • K • S-1

• Apply same transformation to q: q* = qt • Ks • Ss-1

• Then compare transformed vector by using the standard cosine measure

* ( )( *, )* ( )

ts i

i ts i

q Dsim q dq D

•=

(Dst)i denotes the i-th column of matrix Ds

t

The mapping of query vectors can be mathematically explained as follows: it corresponds to adding a new column (like a new document) to matrix M. We can use the fact that Ks

t • Ks =1Since Ms = Ks • Ss • Ds

t we obtain Ss–1 • Ks

t • Ms = Dst or Ds= Ms

t • Ks •Ss

–1

Thus adding columns to Dst requires the transformation that is applied to

obtain q*.

42


Example (SVD, s=2)

Ks Ss Dst

SVD for Term-Document Matrix from the running example.

43


Mapping of Query Vector into Document Space

=

(query "application theory")

44


Ranked Result

s=2 s=4

This is the ranking produced for the query for different values of s.

45


Plot of Terms and Documents in 2-d Space

Since the concept space has two dimensions we can plot both the documents and the terms in the 2 dimensional space. It is interesting to observe of how semantically "close" terms and documents cluster in the same regions. This illustrates very well the power of latent semanticindexing in revealing the "essential" semantics in document collections.

46


Discussion of Latent Semantic Indexing

• Latent semantic indexing provides an interesting conceptualization of the IR problem

• Advantages– It allows reducing the complexity of the underline representational

framework – For instance, with the purpose of interfacing with the user

• Disadvantages– Computationally expensive– Assumes normal distribution of terms (least squares), whereas term

frequencies are a count

47


3. User Relevance Feedback

information need

queryformulation

information itemscontent

featureextraction

ranked/binaryresult

relevance feedback: identify relevant results

system-modifiedquery

(e.g. query term reweighting)

?

!browsing

ranking system

The user does not necessarily know•what his information need is•how to appropriately formulate the query•BUT he can identify relevant documents

Therefore the idea of user relevance feedback is to reformulate the query taking into account feedback of the user on the relvance of retrieved documents.The advantages of such an approach are the following:

•The user is not involved in query formulation, just points to interesting data items.•The search task can be split up in smaller steps.•The search task becomes a process converging to the desired result.

48


Identifying Relevant Documents

• If Cr is the set of relevant documents, then the optimal query vector is

• Idea: approximate this vector by letting the user identify relevant and non-relevant documents

1 1

j jr r

j joptd C d Cr r

q d dC n C∈ ∉

= −−∑ ∑ur ur

r ur ur

Cr some retrieval result R

Dr Dn

documents identifiedby the user as being relevant

documents identifiedby the user as being non-relevant

If we would know the exact set of relevant documents, it can be proven, that the optimal query vector could be expressed in terms of thedocument vectors as shown above. Intuitively this expression can be well understood: it gives, proportional to the frequency of the documents, positive weights to document vectors in the set of relevant documents and negative weights for the others. The problem for practical retrieval is of course that we do not know Cr. However, with user relevance feedback we have the possibility to approximate this optimal query vector.

49


Calculation of Expanded Query

• If users identify some relevant documents Dr from the result set R of a retrieval query q– Assume all elements in R \ Dr are not relevant, i.e. Dn = R \ Dr – Modify the query to approximate the theoretically optimal query (Rochio)

\j jr r

j japproxd D d Dr r

q q d dD R Dβ γα

∈ ∉

= + −∑ ∑ur ur

r r ur ur

α, β, γ are tuning parameters

The best approximation for Cr that we can obtain is by considering the set of documents Dr that the user indicated to be relevant, as the set of relevant documents. This set is used to modify the original query vector in a way that tries to approximate the optimal query vector. The scheme for doing this shown here is called Rochio scheme. There exist several variations of this scheme that however follow all the same principle. The tuning parameters are used to set the relative importance of

• Keeping the original vector• Increasing the weight of vectors from Dr

• Decreasing the weights from vectors of the complement of Dr

50


Example

• Query q= "application theory"• Result

• Query reformulation

• Result for reformulated query

0.77: B17 The Double Mellin-Barnes Type Integrals and Their Applications to Convolution Theory0.68: B3 Automatic Differentiation of Algorithms: Theory, Implementation, and Application0.23: B11 Oscillation Theory for Neutral Differential Equations with Delay0.23: B12 Oscillation Theory of Delay Differential Equations

3 17 12 111 1 1 1( ),4 4 12 4approxq q d d d d α β γ= + − + + = = =

r r uur uur uur uur

0.87: B3 Automatic Differentiation of Algorithms: Theory, Implementation, and Application0.61: B17 The Double Mellin-Barnes Type Integrals and Their Applications to Convolution Theory0.29: B7 Knapsack Problems: Algorithms and Computer Implementations0.23: B5 Ideals, Varieties, and Algorithms: An Introduction to Computational Algebraic Geometry

and Commutative Algebra

\j jr r

j japproxd D d Dr r

q q d dD R Dβ γα

∈ ∉

= + −∑ ∑ur ur

r r ur ur

This example shows how query reformulation works. By identifyingdocument B3 as being relevant and modifying the query vector it turns out that new documents (B5 and B7) become relevant. The reason is that those new documents share terms with document B3, and theseterms are newly considered in the reformulated query.

51


Summary

• Which capability of users is taken advantage of with relevance feedback ?

• Which query vector is approximated when adapting a user query with relevance feedback ?

• Can documents which do not contain any keywords of the original query receive a positive similarity coefficient after relevance feedback ?

52


4. Inverted Files

• Problem: text retrieval algorithms need to find words in documents efficiently – Boolean retrieval, vector space retrieval– Given index term ki, find document dj

• An inverted file is a word-oriented mechanism for indexing a text collection in order to speed up this search task– Addressing of documents and word positions within documents– Most frequently used indexing technique for large text databases– Appropriate when the text collection is large and semi-static

B1 A Course on Integral EquationsB2 Attractors for Semigroups and Evolution EquationsB3 Automatic Differentiation of Algorithms: Theory, Implementation, and ApplicationB4 Geometrical Aspects of Partial Differential EquationsB5 Ideals, Varieties, and Algorithms: An Introduction to Computational Algebraic Geometry and Commutative AlgebraB6 Introduction to Hamiltonian Dynamical Systems and the N-Body ProblemB7 Knapsack Problems: Algorithms and Computer ImplementationsB8 Methods of Solving Singular Systems of Ordinary Differential EquationsB9 Nonlinear SystemsB10 Ordinary Differential EquationsB11 Oscillation Theory for Neutral Differential Equations with DelayB12 Oscillation Theory of Delay Differential EquationsB13 Pseudodifferential Operators and Nonlinear Partial Differential EquationsB14 Sinc Methods for Quadrature and Differential EquationsB15 Stability of Stochastic Differential Equations with Respect to Semi-MartingalesB16 The Boundary Integral Approach to Static and Dynamic Contact ProblemsB17 The Double Mellin-Barnes Type Integrals and Their Applications to Convolution Theory

application

B3, B17

In order to implement text retrieval models efficiently, efficient search for occurrences in documents must be supported. For that purposedifferent indexing techniques exist, among which inverted files are the by far most widely used. Inverted file are optimized for supporting search on relatively static text collections. For example no database updates are supported with inverted files. This distinguishes inverted files from typical database indexing techniques, such as B+-Trees.

53


Inverted Files

Inverted list lk for a term k

fk number of documents in which k occursdi1,…,difk list of document identifiers of documents containing k

Inverted File: lexicographically ordered sequence of inverted lists

1: , . . . ,

f kk k i il f d d=

, , , 1, . . . ,ii kI F i k l i m= =

Inverted files are constructed by concatenating the inverted lists for all keywords occurring in the document collection. The inverted lists enumerate all occurrences of the keyword in documents, by keeping the document identifiers and the frequency of occurrence (this is useful for determining inverse document frequency, for example).

54


Example

1 Algorithms 3 : 3 5 72 Application 2 : 3 173 Delay 2 : 11 124 Differential 8 : 4 8 10 11 12 13 14 155 Equations 10 : 1 2 4 8 10 11 12 13 14 156 Implementation 2 : 3 77 Integral 2 : 16 178 Introduction 2 : 5 69 Methods 2 : 8 1410 Nonlinear 2 : 9 1311 Ordinary 2 : 8 1012 Oscillation 2 : 11 1213 Partial 2 : 4 1314 Problem 2 : 6 715 Systems 3 : 6 8 916 Theory 4 : 3 11 12 17

Here we display the inverted list that is obtained for our example document collection.

55


Physical Organization of Inverted Files

D1 abcdefghijklD2 abcdefghijklD3 abcdefghijkl.Di abcdefghijkl..Dj abcdefghijkl ...Dn abcdefghijkl

Document file

documents stored in a contiguous file

Posting fileDoc#

DiDj........Dk

occurrences of words are stored ordered lexicographically

space requirement O(n) space requirement O(n)

Index fileKey, #Docs, Pos

k1 f1 p1k2 f2 p2...km fm pm

one entry foreach term of

the vocabulary

space requirement O(nβ)0.4<β<0.6(Heap's law)

Accessstructure

access structure to the vocabulary can be

B+-Tree, Hashing or Sorted Array

space requirement O(nβ)

main memorysecondary storage

Inverted files as introduced before are a logical data structure for which we have to find a physical storage organization. This physical organization has to take into account the quantitative characteristics of the inverted file structure. The important observation is: the number of references to documents, corresponding to the occurrences of index terms in the documents is much larger than the number of index terms, and thus the number of inverted lists. In fact, the order of magnitude of occurrences of index terms is O(n), whereas the number of index terms is typically O(n^β), where β is roughly 0.5. For example, a document collection of size n= 10^6 would have approximate m=10^3 index terms. Therefore the index terms and the corresponding frequencies of occurrences can be kept in main memory, whereas the references to documents are kept in secondary storage. As a result an index file is kept in memory. The access to this index file is supported by any suitable data access structure. Typically binary search, hash tables or tree-based structures, such as B+-Trees, or tries are used for that purpose. The posting files consists of the sequence of all occurrences of the inverted file. The index file is related to the posting file by keeping for each index term a reference to the position in the posting file where the entries related to the index terms start. The occurrences stored in the posting file in turn refer to entries in the document file, which is also kept in secondary storage.

56


Example

Algorithms 3Application 2Delay 2Differential 8

357317111248101112131415

B1 A Course on Integral EquationsB2 Attractors for Semigroups and Evolution

EquationsB3 Automatic Differentiation of Algorithms:

Theory, Implementation, and ApplicationB4 Geometrical Aspects of Partial Differential

EquationsB5 Ideals, Varieties, and Algorithms: An

Introduction to Computational Algebraic Geometry and Commutative Algebra

B6 Introduction to Hamiltonian Dynamical Systems and the N-Body Problem

B7 Knapsack Problems: Algorithms and Computer Implementations

B8 Methods of Solving Singular Systems of Ordinary Differential Equations

B9 Nonlinear SystemsB10 Ordinary Differential EquationsB11 Oscillation Theory for Neutral Differential

Equations with Delay

This would be the physical organization of the inverted file for the running example. Note that only part of the data is displayed.

57


Searching the Inverted File

Step 1: Vocabulary search– the words present in the query are searched in the index file

Step 2: Retrieval of occurrences– the lists of the occurrences of all words found are retrieved from the

posting fileStep 3: Manipulation of occurrences

– the occurrences are processed in the document file to process the queryA l g o r i t h m s 3A p p l i c a t i o n 2D e l a y 2D i f f e r e n t i a l 8

35731 71 11 2481 01 11 21 31 41 5

B1 A Course on Integral EquationsB2 Attractors for Semigroups and Evolution EquationsB3 Automatic Differentiation of Algorithms: Theory,

Implementation, and ApplicationB4 Geometrical Aspects of Partial Differential EquationsB5 Ideals, Varieties, and Algorithms: An Introduction to

Computational Algebraic Geometry and Commutative Algebra

B6 Introduction to Hamiltonian Dynamical Systems and the N-Body Problem

B7 Knapsack Problems: Algorithms and Computer Implementations

B8 Methods of Solving Singular Systems of Ordinary Differential Equations

B9 Nonlinear SystemsB10 Ordinary Differential EquationsB11 Oscillation Theory for Neutral Differential Equations with

Delay

Search in an inverted file is a straightforward procedure. Using the data access structure first the index terms occurring in the query are searched in the index file. Then the occurrences can be sequentially retrieved from the postings file. Afterwards the corresponding document portions are accessed and can be processed (e.g. counting frequencies).

58


Construction of the Inverted File

Step 1: Search phase– The vocabulary is kept in a trie data structure storing for each word a list of

its occurrences– Each word of the text is read sequentially and searched in the vocabulary– If it is not found, it is added to the vocabulary with an empty list of

occurrences– The word position is added to the end of its list of occurrences

Step 2: Storage phase (once the text is exhausted)– The list of occurrences is written contiguously to the disk (posting file)– The vocabulary is stored in lexicographical order (index file) in main memory

together with a pointer for each word to its list in the posting file

Overall cost O(n)

The index construction is performed by first constructing dynamically a trie structure, in order to build up a sorted vocabulary and to collect the occurrences of index terms. After the complete document collection has been traversed the trie structure is sequentially traversed and the posting file is written to secondary storage. The trie structure itself can be used as a data access structure for the index file that is kept in main memory.

59


Example

1 6 12 16 18 25 29 36 40 45 54 58 66 70

the house has a garden. the garden has many flowers. the flowers are beautiful

(each word = one document, position = document identifier)

the: 1

the: 1 house: 6

h t

the: 1

house: 6

h t

a o

has: 12

the: 1 garden: 18

house: 6

h t

a o

has: 12

a

a: 16

g

has: 12

the: 1

house: 6

h t

a o

a

a: 16

garden: 18

house: 6

h t

a o

has: 12

a

a: 16

g

the: 1, 25

In this example we consider each word of the text as a separate document identified by its position (for space limitations). We demonstrate the initial steps of constructing the trie structure and entering into it the occurrences of index terms. The changes to the triestructure are highlighted for each step. Note that in the last step the tree structure of the trie does not change, since the index term "the" is already present.

60


Example

1 6 12 16 18 25 29 36 40 45 54 58 66 70


garden: 18, 29

house: 6

h t

a o

has: 12, 36

aa: 16

g

the: 1, 25, 54

f

flowers: 45, 58

are: 66

rmany: 40

b

beautiful: 70

m

a: 16are: 66beautiful: 70flowers: 45, 58garden: 18, 29has: 12, 36 house: 6 many: 40the: 1, 25, 54

inverted file I 16, 66, 70, 45, 58, 18, 29, 12, 36, 6, 40, 1, 25, 54

postings file

Once the complete trie structure is constructed the inverted file can be derived from it. For doing this the trie is traversed top-down and left-to-right. Whenever an index term is encountered it is added to the end of the inverted file. Note that if a term is prefix of another term (such as "a" is prefix of "are") index terms can occur on internal nodes of the trie. Analogously to the construction of the inverted file also the posting file can be derived.

61


Example

1 6 12 16 18 25 29 36 40 45 54 58 66 70


garden: 6

house: 10

h t

a o

has: 8

aa: 1

g

the: 12

f

flowers: 4

are: 2

rmany: 11

b

beautiful: 3

m

16, 66, 70, 45, 58, 18, 29, 12, 36, 6, 40, 1, 25, 54

postings file

Considering the physical organization of the inverted file the result can be displayed as shown. The trie structure constructed is a possible access structure to the index file in main memory. Thus the entries of the index files occur as leaves (or internal nodes) of the trie. Each entry has a reference to the position of the postings file that is held in secondary storage.

62


Index Construction in Practice

• In practice not all index information can be kept in main memory• Index merging

– When no more memory is available, a partial index Ii is written to disk– The main memory is erased before continuing with the rest of the text– Once the text is exhausted, a number of partial indices Ii exist on disk– The partial indices are merged to obtain the final index

I 1...8

I 1...4 I 5...8

I 1...2 I 3...4 I 5...6 I 7...8

I 1 I 2 I 3 I 4 I 5 I 6 I 7 I 8

1 2 4 5

3 6

7

final index

initial indices

level 1

level 2

level 3

In practice the construction will be inefficient or impossible if the size of the intermediate trie structure constructed exceeds the main memory space. Then the index construction process has to be partitioned in the following way: While the document collection is sequentially traversed, partial indices are written to the disk whenever the main memory is full. This results in a number of partial indices, indexing consecutive partitions of the text. In a second phase the partial indices need to be merged into one index.This figure illustrates the merging process: in this example 8 partial indices have been constructed. Step by step the indices are merged, by merging two indices into one, until one final index remains. Themerging can be performed, such that the two partial indices which are to be merged are scanned in parallel sequentially on the disk, and while scanning the resulting index is written sequentially to the disk.

63


Example

1 6 12 16 18 25 29 36 40 45 54 58 66 70


are: 66beautiful: 70flowers: 45, 58many: 40the: 54

a: 16garden: 18, 29has: 12, 36 house: 6 the: 1, 25

inverted file I1

inverted file I2

a: 16are: 66beautiful: 70flowers: 45, 58garden: 18, 29has: 12, 36 house: 6 many: 40the: 1, 25, 54 1, 25 + 54 -> 1, 25, 54

concatenate inverted liststotal cost: O(n log2(n/M))

M size of memory

Merging the indices requires first merging the vocabularies. As we know, the vocabularies are comparably small and thus the merging of the vocabularies can take place in main memory. In case a vocabulary term occurs in both partial indices, their list of occurrences from the posting file need to be combined. Here we can take advantage of the fact that the partial indices have been constructed by sequentially traversing the document file. Therefore these lists can be directly concatenated without sorting. The total computational complexity of the merging algorithm is O(nlog2(n/M)). This implies that the additional cost of merging as compared to the purely main memory based construction of inverted files is a factor of O(log2(n/M))). This is small in practice, e.g. if the database size n is 64 times larger than the main memory size, then this factor would be 6.This example illustrates how the merging process can be performed for example when the database is partitioned into two parts.

64


Addressing Granularity

• Documents can be addressed at coarser and finer granularities– coarser: text blocks spanning multiple documents– finer: paragraph, sentence, word level

• General rule– the finer the granularity the less post-processing but the larger the index

• Example: index size in % of document collection size

73%

26%

25%

64%

32%

2.4%

63%

47%

0.7%

Addressing words

Addressing documents

Addressing 256K blocks

Index Small collection

(1Mb)

Medium collection

(200Mb)

Large collection

(2Gb)

The posting file has the by far largest space requirements. An important factor for the size of an inverted file is the addressing granularity used. The addressing granularity determines of how exactly positions of index terms are recorded in the posting file. There exist three main options:•Exact word position•Occurrence within a document•Occurrence within an arbitrary sized block = equally sized partitions of the document file spanning probably multiple documentsThe larger the granularity, the fewer entries occur in the posting file. In turn, with coarser granularity additional postprocessing is required in order to determine exact positions of index terms.Experiments illustrate the substantial gains that can be obtained with coarser addressing granularities. Coarser granularities lead to a reduction of the index size for two reasons: •a reduction in pointer size (e.g. from 4 Bytes for word addressing to 1 Byte with block addressing) •and a lower number of occurrences. Note that in the example for a 2GB document collection with 256Kblock addressing the index size is reduced by a factor of almost 100.

65


Index Compression

• Documents are ordered and each document identifier dij is replaced by the difference to the preceding document identifier– Document identifiers are encoded using fewer bits for smaller, common

numbers

• Use of varying length compression further reduces space requirement

• In practice index is reduced to 10- 15% of database size

1

1 2 1

'1

: , . . . ,

: , , . . . ,

f k

fk fk

k k i i

k k i i i i i

l f d d

l f d d d d d −

= →

= − −

X code(X)1 02 10 03 10 14 110 005 110 016 110 107 110 118 1110 00063 111110 11111

A further reduction of index size can be achieved by applying compression techniques to the inverted lists. In practice the inverted list of a single term can be rather large. A first improvement is achieved by storing only differences among subsequent document identifiers. Since they occur in sequential order the differences are much smaller integers than the absolute values.In addition number encoding techniques can be applied to the resulting integer values. Since small values will be more frequent than large ones this leads to a further reduction in the size of the posting file.

66


Summary

• Which aspect of information retrieval is addressed by inverted files ?

• How do inverted files compare to other database indexing approaches ?

• How is an inverted list organized and which are methods to compress it ?

• Which is the file organization of an inverted file ?

• Why are different addressing granularities used in inverted files ?

• Which problem occurs in the construction of an inverted file and how is it addressed ?

67


5. Web Information Retrieval

• Full text retrieval result with equal ranking; which page is more relevant ?– relevance related to number of referrals (incoming links) – relevance related to number of referrals with high relevance

DVD player

1

DVD player

2

Sony EU

Sony CH Sony D

Sony US

ExciteYahoo

DVD expert

Alta Vista

When retrieving documents from the Web the link structure bears important information on the relevance of documents. Generally speaking, a document that is referred more often by other documents by Web links, is likely to be of higher interest and therefore relevance. So a possibility to rank documents is by considering the number of incoming links. Doing this allows to distinguish documents that otherwise would be ranked equally or similarly when relying on text retrieval solely.However, when doing this, also the importance of documents having a link to the document to be ranked may be different. Therefore not only counting then number of incoming links, but also weighting the links by the relevance of documents from which the links emanate appears to be appropriate. The same reasoning of course again applies then again for evaluating the relevance of documents pointing to the document and so forth.

68


N is the number of Web pagesC(p) is the number of outgoing links of page pP(pi) probability to visit page pi, where page pi is pointed to by pages p1 to pN = relevance

Random Walker Model

• Assumption: If a random walker visits a page more often it is more relevant: takes into account the number of referrals AND the relevance of referrals

|

( )( )

( )j j i

ji

p p p j

P pP p

C p→

= ∑

p1

pj

pN

C(pj)

pi

In order to capture the process of evaluating the relevance of documents by considering incoming links, assuming the relevance of documents is known (which is a recursive procedure) the random walker model is used. The intuitive idea is that if a random walker on the web graph visits a document more often the document is more relevant. Thisintuition is captured mathematically in a recursive equation characterizing the probability of visiting a specific page by a random walker. The assumption on the random walker used in this mathematical formulation of the process is that it follows every link of document with equal probability.

69


Transition Matrix for Random Walker

• The definition of P(pi) can be reformulated as matrix equation

• The vector of relevance of pages is an Eigenvector of the matrix R

1

11

1 ,( )

0,

( ( ), ..., ( ))

. , 1

j ijij

nn

ii

if p pC pR

otherwise

p P p P p

p R p p p=

⎧ →⎪= ⎨⎪⎩

=

= = =∑

ur

uuurur ur

If we consider the matrix of transition probabilities for the random walker, the recursive definition of the probability to visit a Web page can be formulated as matrix equation. More precisely the resulting equation shows that the vector of probabilities for visiting the Web pages is an Eigen vector of the transition matrix. In fact, it is the one corresponding to the largest Eigen value.

70


Example

P(p1)=2/5 P(p2)=1/5

P(p3)=2/5

C(p2)=1C(p3)=1

C(p1)=2 0 0 11 0 01 1 0

⎛ ⎞⎜ ⎟⎜ ⎟⎜ ⎟⎝ ⎠

Link Matrix

Links to p1

Links from p1

25

1 12 5

2152

0 0 10 0 ,

1 0

R p

⎛ ⎞ ⎛ ⎞⎜ ⎟ ⎜ ⎟

= =⎜ ⎟ ⎜ ⎟⎜ ⎟ ⎜ ⎟⎜ ⎟ ⎝ ⎠⎝ ⎠

ur

Ranking p1, p3 > p2

This example illustrates the computation of the probabilities for visiting a specific Web page. The values C(pi) correspond to the transition probabilities. They can be derived from the link matrix, which is the matrix with entries 1 at (i,j) if there exists a link from pi to pj, by dividing the values in the columns by the sum of the values found in the column. The probability of a random walker being in a node is then obtained from the Eigen vector of this matrix.

71


Modified Example

P(p1)=0 P(p2)=0

P(p3)=0

C(p2)=1

C(p1)=2 0 0 01 0 01 1 0

⎛ ⎞⎜ ⎟⎜ ⎟⎜ ⎟⎝ ⎠

Link Matrix

Links to p1

Links from p1

1212

0 0 0 00 0 , 0

01 0

R p

⎛ ⎞ ⎛ ⎞⎜ ⎟ ⎜ ⎟= =⎜ ⎟ ⎜ ⎟⎜ ⎟ ⎜ ⎟⎜ ⎟ ⎝ ⎠⎝ ⎠

ur

No Ranking

The approach described so far has however a problem. When looking at the modified example we see that there exists a node p3 that is a "sink of rank". Any random walk ends up in this sink, and therefore the other nodes do not receive any ranking weight. Consequently also the rank of sink does not. Therefore the only solution to the equation p=Rp is the zero vector.

72


Source of Rank

• Assumption: random walker jumps with probability 1-q to an arbitrary node: thus nodes without incoming links are reached

• PageRank method

|

( )(1 )( ) ( ), 1( )

j j i

ji

p p p j

P pqP p c q cN C p→

−= + ≤∑

p1

pj

pN

C(pj)

pi

1-q

q

1(( (1 ) )). ,

(1 )( . ), (1,...,1)

NxN

p c qR q E p EN

qp c qR p e eN

⎡ ⎤= + − = ⎢ ⎥⎣ ⎦−

= + =

r r

r r r r

N nodes

To avoid the previously described problem we add a "source of rank". The idea is that a random walker in each step can rather than following a link jump to any page with probability 1-q. Therefore also pages without incoming links can be reached by the random walk. In the mathematical formulation of the random walk this means that a term for the source of rank is added. Since at each step the random walker makes a jump with probability 1-q and any of the N pages is reached with the same probability the additional term is (1-q)/N. Reformulating this again as matrix equation means adding a NxN Matrix E with all entries being 1/N. This is equivalent to saying that with probability 1/N transitions among any pairs of nodes (including transition from a node to itself) are performed. Since the vector p has norm 1, i.e. the sum of the components is exactly 1, E.p=e, where e is the unit vector, and the matrix equation can be reformulated in the second form shown. The method described is called PageRank and is used by Google.

73


Modified Example

P(p1)=0.123 P(p2)=0.275

P(p3)=0.953

C(p2)=1

C(p1)=2

1 1 13 3 3

1 1 1 12 3 3 3

1 1 113 3 32

1 1 130 30 3029 1 160 30 3029 14 160 15 30

0 0 00 0 , , 0.9

1 0

0.123(1 ) 0.275

0.953

R E q

qR q E p

⎛ ⎞ ⎛ ⎞⎜ ⎟ ⎜ ⎟

= = =⎜ ⎟ ⎜ ⎟⎜ ⎟ ⎜ ⎟⎜ ⎟ ⎝ ⎠⎝ ⎠

⎛ ⎞ ⎛ ⎞⎜ ⎟ ⎜ ⎟+ − = =⎜ ⎟ ⎜ ⎟

⎜ ⎟⎜ ⎟ ⎝ ⎠⎝ ⎠

ur

Ranking p3 > p2 > p1

With the modification of rank computation using a source of rank, we obtain for our example again a non-trivial ranking which appears to be appropriate.

74


Practical Computation of PageRank

• Iterative computation

0

1

1 1

1 1

(1 )i i

i i

i i

p swhile

p qR pqp p e

Np p

δ ε

δ

+

+ +

+

←

>← •

−← +

← −

r r

r r

r r r

r r

ε termination criterions arbitrary vector, e.g. s=e

For the practical determination of the PageRank ranking an iterative computation is used. It is derived from the second form of the formulation of the visiting probabilities of the random walker that we have given. The vector e used to add a source of rank not necessarily has to assign uniform weights to all pages, but might reflect itself a ranking of Web pages.

75


Example: ETHZ Page Rank

These are the top documents from the PageRank ranking of all Web pages at ETHZ.

76


Practical Usage of PageRank

• PageRank is part of the ranking method used by Google – Compute the global PageRank for all Web pages– Given a keyword-based query retrieve a ranked set of documents using

standard text retrieval methods– Merge the ranking with the result of PageRank to both achieve high precision

(text retrieval) and high quality (PageRank)– Google uses also other methods to improve ranking (trade secret)

• Main problems– Crawling the Web– Efficient computation of Page Rank for large link databases– Combination with other ranking methods (text)

• Some (old) numbers (1998)– 24 million Web pages, 75 million links, 300 MB link database– Convergence in roughly 50 iterations on one workstation– One iteration takes 6 minutes

PageRank is used as one criterion to rank result documents in Google. Essentially Google uses text retrieval methods to retrieve relevant documents and then applies PageRank to create a more appropriate ranking. Google uses also other methods to improve ranking, e.g. by giving different weights to different parts of Web documents. For example, title elements are given higher weight. The details of the ranking methods are trade secrets of the Web search engine providers.Other issues of Web search engines are crawling the Web, which requires techniques that can explore the Web without revisiting pages too frequently. Also the enormous size of the document and link database poses implementation challenges in order to keep the ranking computations scalable.

77


Hub-Authority Ranking

• Postprocessing of query result by analyzing the link structure– Take result pages and extend the set by all pages that point to or are pointed

to by result pages – Data can be obtained from existing search engines

• Key idea: identify not only authorative pages (like with PageRank) but also hubs– Hubs are pages that point to many/relevant authorities– Authorities are pages that are pointed to by many/relevant hubs

Hub-Authority ranking identifies not only pages that have a high authority, measured by the number of incoming links, but also pages that have a substantial "referential" value, having many outgoing links (to pages of high importance). In difference to the PageRank algorithm this technique has been proposed to post-process query results (rather than to rank pages from the complete Web graph). It is used as a add-on to existing search engines.

78


Definition of Hubs and Authorities

• Direct application of the definition

• Values should be normalized

( ) ( )j i j

i jp N p p

H p A p∈ ⏐ →

= ∑

( ) ( )j j i

i jp N p p

A p H p∈ ⏐ →

= ∑

2( ) 1j

jp N

A p∈

=∑ 2( ) 1j

jp N

H p∈

=∑

Note the inversion in the direction of the link in the indices of the summation (which corresponds to authorities point to …, and hubs are pointed to by …)

79


Iterative Computation

1 1

1 1

0 0 2

1,1 11,

,1 ,

1( , ) ((1,...,1), (1,...,1))

1( , ) (( ,..., ), )

( , ) ( , ( ,..., ))

( , ) ( , )

( , )

i iN N

i iN N

l l l ll Np N p p p N p p

l l l l l Np N p p p N p p

l ll l

l l

k k

x yN

while l kl lx y y y y

x y x x x

x yx yx y

return x y

− −−∈ ⏐ → ∈ ⏐ →

∈ ⏐ → ∈ ⏐ →

←

<← +

←

←

←

∑ ∑

∑ ∑

x ~ authoritiesy ~ hubs

As for PageRank the Hub/Authority values can be iteratively computed. X corresponds to the authority weight and y to the hub weight. Also for this computation there exists a simple explanation in terms of Eigenvalues. If L is the link matrix then the computation of x from y is defined as x=Lty and the computation of y from x is defined as y=Lx. Therefore x*is the principal Eigenvector of Matrix LtL and y* is the principal Eigenvector of matrix LLt.

80


Computing the Base Set

• Base Set = set of pages used to apply H/A algorithm– Should be small– Should contain many relevant pages– Should contain strongest authorities

• Select the t highest ranked pages (e.g. t=200): Root set Rσ

• Extend the root set by the pages that point or are pointed to by the root set

When applying hub-authority ranking to provide improved ranking of query results the selection of the base set used for the computation of HA is central. Using the complete result of a text-based query is not feasible as it may contain millions of documents. On the other hand the result may not contain important authority pages. Selecting the top ranked pages will reduce the number of pages and contain probably a few authorities. In order to capture more authorities, also pages poitingto and pointed by the query result are included, as it is very likely that the query result will contain pointers to authorities.

81


Base Set Algorithm

The set of pages pointing to a result page, needs to be restricted as this set could be very large.

82


Sample Results Obtained

These are two example results that have been obtained by applying this method. (It is interesting to compare those to the ones you would obtain by using a search engine alone). In particular Gamelan will not show up in the result, whereas the Java Sun page is usually among the top ranked.

83


Summary

• Which additional source of information can be used to rank Web pages ?

• What is the informal idea underlying PageRank ?

• Why is a source of rank used in PageRank and what can it be used for ?

• How is PageRank practically computed ?

84


References

• Course material based on– Ricardo Baeza-Yates, Berthier Ribeiro-Neto, Modern Information Retrieval

(ACM Press Series), Addison Wesley, 1999.

• Relevant articles– Sergey Brin , Lawrence Page, The anatomy of a large-scale hypertextual Web

search engine, Computer Networks and ISDN Systems, v.30 n.1-7, p.107-117, April 1, 1998.

– Jon M. Kleinberg: Authoritative Sources in a Hyperlinked Environment. JACM 46(5): 604-632 (1999)