Indexing - l3s.deanand/tir15/lectures/ws15-tir-indexing.pdfTemporal Information Retrieval Index Organization, Construction, Temporal Query ... Each term descriptor contains the term

Indexing

Temporal Information Retrieval

Index Organization, Construction, Temporal Query Processing

Avishek Anand

✤ Inverted Indexing basics revisited

✤ Indexing Static Collections

✤ Dictionaries

✤ Forward Index

✤ Inverted Index Organisation

✤ Scalable Indexing

✤ Indexing Temporal Collections

Inverted Index

2

Avishek Anand

✤ Why do we index text collections ?

✤ How do we index documents ?

✤ What are the data structures ?

✤ What are the design decisions for organising the index?

✤ How do we index huge collections ?

✤ How do we index temporal collections ?

Text Collections and Indexing

3

Avishek Anand

✤ Why do we index text collections ?

✤ How do we index documents ?

✤ What are the data structures ?

✤ What are the design decisions for organising the index?

✤ How do we index huge collections ?

✤ How do we index temporal collections ?

Text Collections and Indexing

3

Efficient document retrieval

lexicon, inverted lists

document order, score order

distributed indexing, term/doc partitioning

index maintenance strategies

Avishek Anand

Terminology Recap

4

information retrieval

Lexicon

queries, results

terms, documents, collection

index, lexicon, posting, posting list

stemming, stop-word rem

(d1 , 2, <2, 15>)

(d4 , 3, <2, 15, 23, >)

(d34 , 3, <2, 15, 23>)

…….

(d1 , 2, <3, 16>)

(d4 , 2, <23, 16>)

(d31 , 4, <8, 19, 30>)

…….

Avishek Anand

✤ Maintains statistics and information about the indexed unit (word, n-gram etc)

✤ Posting list location - for posting list retrieval

✤ Term identifier - for term lookups, matching and range queries

✤ document frequency and associated statistics - for ranking

✤ Data Structures for Lexicon

✤ Hash-based Lexicon

✤ B+-Tree based Lexicon

Lexicon or Dictionary

5

< hannover ; location: 82271; tid:12 ; df:23, … >

Avishek Anand

Hash-Based Lexicon

6

Information Retrieval: Implementing and Evaluating Search Engines · c⃝MIT Press, 2010 · DRAFT108

!"#$%!&%#'%(())*+,*-

'%./%0123%4%!!(,(5*6)6,

##'73#'$5,+8*)*-

9%1":;<=%4>:20)),5,?+,+

&'%0:9=@!12:4)*-,,5+6*

/#47!0#&%!)8-5)+5?)

><A:1%??-*(5*+5

241%'0%!!:'<65-))+-66

.:0@!#=/%6)6(,,855

!"#$%&"'()*"++",-

./((0#0/1%2$"01%*(013)4%(0#&-

./((0#0/1%2$"01%*(013)4%(0#&-

./((0#0/1%2$"01%*(013)4%(0#&-

8

,

)

6

(

?

+

,866

,86)

Figure 4.2 Dictionary data structure based on a hash table with 210 = 1024 entries (data extractedfrom schema-independent index for TREC45). Terms with the same hash value are arranged in a linkedlist (chaining). Each term descriptor contains the term itself, the position of the term’s postings list,and a pointer to the next entry in the linked list.

in GOV2 is 9.2 bytes. Storing each term in a fixed-size memory region of 20 bytes wastes 10.8bytes per term on average (internal fragmentation).

One way to eliminate the internal fragmentation is to not store the index terms themselves inthe array, but only pointers to them. For example, the search engine could maintain a primarydictionary array, containing 32-bit pointers into a secondary array. The secondary array thencontains the actual dictionary entries, consisting of the terms themselves and the correspondingpointers into the postings file. This way of organizing the search engine’s dictionary data isshown in Figure 4.3. It is sometimes referred to as the dictionary-as-a-string approach, becausethere are no explicit delimiters between two consecutive dictionary entries; the secondary arraycan be thought of as a long, uninterrupted string.

For the GOV2 collection, the dictionary-as-a-string approach, compared to the dictionarylayout shown in Figure 4.1, reduces the dictionary’s storage requirements by 10.8 − 4 = 6.8bytes per entry. Here the term 4 stems from the pointer overhead in the primary array; theterm 10.8 corresponds to the complete elimination of any internal fragmentation.

It is worth pointing out that the term strings stored in the secondary array do not require anexplicit termination symbol (e.g., the “\0” character), because the length of each term in thedictionary is implicitly given by the pointers in the primary array. For example, by looking atthe pointers for “shakespeare” and “shakespearean” in Figure 4.3, we know that the dictionaryentry for “shakespeare” requires 16629970−16629951 = 19 bytes in total: 11 bytes for the termplus 8 bytes for the 64-bit file pointer into the postings file.

Avishek Anand

✤ Constant lookups based on a Hash table

✤ Entire Lexicon loaded to the memory

Hash-Based Lexicon

6


!"#$%!&%#'%(())*+,*-

'%./%0123%4%!!(,(5*6)6,

##'73#'$5,+8*)*-

9%1":;<=%4>:20)),5,?+,+

&'%0:9=@!12:4)*-,,5+6*

/#47!0#&%!)8-5)+5?)

><A:1%??-*(5*+5

241%'0%!!:'<65-))+-66

.:0@!#=/%6)6(,,855

!"#$%&"'()*"++",-

./((0#0/1%2$"01%*(013)4%(0#&-

./((0#0/1%2$"01%*(013)4%(0#&-

./((0#0/1%2$"01%*(013)4%(0#&-

8

,

)

6

(

?

+

,866

,86)






Avishek Anand

✤ Constant lookups based on a Hash table

✤ Entire Lexicon loaded to the memory

Hash-Based Lexicon

6


!"#$%!&%#'%(())*+,*-

'%./%0123%4%!!(,(5*6)6,

##'73#'$5,+8*)*-

9%1":;<=%4>:20)),5,?+,+

&'%0:9=@!12:4)*-,,5+6*

/#47!0#&%!)8-5)+5?)

><A:1%??-*(5*+5

241%'0%!!:'<65-))+-66

.:0@!#=/%6)6(,,855

!"#$%&"'()*"++",-

./((0#0/1%2$"01%*(013)4%(0#&-

./((0#0/1%2$"01%*(013)4%(0#&-

./((0#0/1%2$"01%*(013)4%(0#&-

8

,

)

6

(

?

+

,866

,86)






✤ Updates difficult

✤ Range Searches, Matching, Substring queries not supported

Avishek Anand

✤ B+-Tree: Leaf nodes additionally linked for efficient range search

✤ Supports lookups in O(log n) and range searches in O(log n + k)

✤ Vocabulary dynamics (i.e., new or removed terms) no problem

✤ Works on secondary storage

B+-Tree or Sort-based Lexicon

7

[aardvark, tid:3, df:3, …]

[a-i][j-z]

[j-k][l-q][r-z][a-d][e-f][g-i]

[a-b][c][d] [e][f] [g][h][i] … … …

m = 3

[aalborg, tid:7, df:2, …]

Avishek Anand

• Mapping of doc-ids to term-ids in the same order

• Efficient retrieval of terms from (already parsed) text

• snippet generation

• proximity features for proximity-aware ranking

• per-doc term distribution for query expansions

Forward Index

8

1: “what does the fox say ?”

124 53 1 49935 1001:

Avishek Anand

✤ Inverted index is a collection of posting lists

✤ Posting contains document identifiers (as integers) along with scores (integers or doubles) and possibly positions (as integers)

✤ Postings list can be organised according to

✤ document identifiers - document ordering

✤ scores - Impact ordering

✤ What are the merits of these orderings ?

Inverted Index

9

information

(d1 , 2, <2, 15>)

(d4 , 5, <2, 15, 23, >)

(d34 , 3, <2, 15, 23>)

…….

Avishek Anand

✤ Based on faster intersections

✤ High compression of index using gap encoding of dids

✤ Easily updatable

Index Organisation

10

✤ Based on processing Top-k results fast

✤ Low compression ratio

✤ Difficult to update

Document Ordering Score/Impact Ordering

Index organisation depends on query processing style.

Avishek Anand

✤ We are given a set of documents D, where each document d is considered as a bag of terms

✤ Inverted Lists are created by a process termed as Inversion

✤ Memory-based Inversion

✤ Takes place entirely in-memory

✤ For small collections, where the index + lexicon fits in memory

✤ Disk-based Inversion

✤ Sort-based inversion vs Merge-based inversion

Inverted Index Construction

11

Avishek Anand

✤ A dictionary is required that allows efficient single-term lookup and insertion operations

✤ An extensible (i.e., dynamic) list data structure is needed that is used to store the postings for each

Memory-based Inversion

12

1: “what does the fox say ?”

2: “the fox jumped over the fence”

[the, <3>] [fox, <4>]

[the, <1,5>] [fox, <2>]

dictionary1:[the, <1,5>] “the”: [1, <3>] [2, <1,5>]

[term, positions]

…. ….

…. ….

doc: [term, positions] [term, posting list]

Avishek Anand

✤ Input Collection D >> memory size M

✤ Inversion can be seen as a sort operation on the term identifiers

✤ This method is based on external sort over data which does not fit into the memory

✤ Read data of size M into memory, sort them and write back to disk

✤ Multiway merge of D/M sorted lists to create index

✤ Shortcomings

✤ Dictionary might not fit in-memory

✤ Large memory requirements due to intermediate data

Sort-based Inversion

13

Avishek Anand

✤ What is the estimated cost of sort-based Inversion in terms of N,M and c ?

✤ How does the cost compare with in-memory sort-based inversion (assuming we had enough memory or N > M) ?

Exercise 1: Analysis of Sort-based Inversion

14

Total number of postings = N

Number of postings which fit in memory = M

Cost of disk read/write of a posting = c

Simple Computational Model

Avishek Anand

✤ Generalisation of in-memory indexing

✤ Reads input collection to create an in-memory index of size M and write it to disk to create partial indexes with local lexicons

✤ Compression in posting lists in partial indexes

✤ Multiway Merge of corresponding lists from the partial indexes to create one consolidated index

Merge-based Inversion

15

partial indexes of size M

Avishek Anand

✤ Programming paradigm for distributed data processing

✤ Improves overall throughput by parallelising loading of data

✤ Data is partitioned into the nodes which process the data in the following phases

✤ Map : Generates (key, value) pairs

✤ Shuffle : Shuffles the pairs over the network to the reducers

✤ Reduce : operates on all values for the same key are

Map-Reduce crash course

16

Avishek Anand

Map-Reduce Example : Word Count

17

1: “what does the fox say ?” 2: “the fox jumped over the fence”

what : 1does : 1

the : 1

fox : 1

say : 1

Mapper - 1 jumped : 1over : 1

the : 2

fox : 1

fence : 1

Mapper - 2

Shuffle + Sort

Reducer - 1

what : 1does : 1 the : 1fox : 1 say : 1jumped : 1 over : 1

the : 2fox : 1

fence : 1

Reducer - 2

mappers emit <word, freq>

reducers aggr. freq.

+ +

Avishek Anand

✤ How would you build the inverted index using Map-reduce ?

✤ What are the key-value pairs as defined by the Mapper ?

✤ What does the reducer do with the values of the same key ?

Exercise 2: Index Construction using Map-Reduce

18

Avishek Anand

✤ Temporal collections have temporal information

✤ Publication times - News Articles

✤ Valid times - Wikipedia articles, Web archive versions

✤ Temporal references - time mentions in text

✤ Time-travel Queries : Retrieve all documents relevant to the text and time

✤ Point in time queries

✤ Time-interval queries

Temporal Collections and Queries

19

game of thrones @ 2011 - 2014

house of cards @ 02/03/2013

Avishek Anand

✤ Given a versioned collection of documents with valid time intervals

✤ and a time-travel text query:

✤ We want to retrieve documents containing terms “game” ,“thrones” and valid between 2011 - 2014

Temporal Indexing

20

How do we efficiently retrieve these documents ?

game of thrones @ 2011 - 2014

Avishek Anand

Time-Travel Index

21

intervention

Lexicon• Inverted index processes keyword

queries

• Intersection of posting lists for processing queries

• Versions have valid time intervals

• Augment postings with valid time intervals

• Post filtering after standard query processing

nato

d5

d1

d2

d3

d4

d1

d4

d8

Avishek Anand

Time-Travel Index

21

intervention

Lexicon• Inverted index processes keyword

queries





nato

d5

d1

d2

d3

d4

d1d2

d3d4

d5

timet1 t91 1

Avishek Anand

Time-Travel Index

21

intervention

Lexicon

(d5, [t5, t7))

(d1, [t1, t9))

(d2, [t2, t8))

(d3, [t3, t6))

(d4, [t4, t6))

• Inverted index processes keyword queries





nato

d1d2

d3d4

d5

timet1 t91 1

Avishek Anand

Time-Travel Index

21

intervention

Lexicon

(d5, [t5, t7))

(d1, [t1, t9))

(d2, [t2, t8))

(d3, [t3, t6))

(d4, [t4, t6))

• Inverted index processes keyword queries





overlaps

overlaps

overlaps

no overlap

no overlap

nato

d1d2

d3d4

d5

timet1 t91 1

Avishek Anand

Challenges in Indexing

22

• If documents are points in time how would you organize the index ?

• What is the problem if the documents are associated with time intervals ?

• Query processing expensive due to wasted accesses

intervention

Lexicon

(d5, [t5, t7))

(d1, [t1, t9))

(d2, [t2, t8))

(d3, [t3, t6))

(d4, [t4, t6))

overlaps

overlaps

overlaps

no overlap

no overlap

nato

d1

d4

d8

Avishek Anand

Data and Query Model

23

temporal distribution

• Each interval represents a document in the posting list

• Data Model: Each document is associated with a time interval

• Query Model: Queries are associated with a time interval [tb, te]

• point in time queries : when begin time = end time

• time interval queries

(d5, [t5, t7))

(d1, [t1, t9))

(d2, [t2, t8))

(d3, [t3, t6))

(d4, [t4, t6))

postings list query interval

Avishek Anand

Challenges in Indexing Time

24

• We would want to avoid unwanted or wasted access to posting lists

• Typically only access those postings that are relevant or a few more (bounded loss)

• Dealing with time points easy, akin to range queries (sorting acc to begin time and range search)

(d5, [t5, t7))

(d1, [t1, t9))

(d2, [t2, t8))

(d3, [t3, t6))

(d4, [t4, t6))

..

..

..

posting lists long

Avishek Anand

Challenges in Indexing Time

24

• We would want to avoid unwanted or wasted access to posting lists

• Typically only access those postings that are relevant or a few more (bounded loss)

• Dealing with time points easy, akin to range queries (sorting acc to begin time and range search)

(d5, [t5, t7))

(d1, [t1, t9))

(d2, [t2, t8))

(d3, [t3, t6))

(d4, [t4, t6))

..

..

..

posting lists long

query interval

d1

d2d3

d4d5

Avishek Anand

Index List Partitioning

25

• Vertically Partition the temporal space and each partition

• Now multiple posting lists per term, each with a valid time interval

• Limits index access, introduces replication

Time-travel queries

t2t1 t4 t7 t9 t11 t13 t16

( d1, [t1, t2) )

( d2, [t4, t9) )

( d3, [t7, t13) )

( d4, [t9, t11) )

McCain

( d5, [t11, t16) )

( d3, [t7, t13) )

( d2, [t4, t9) )

McCain[t3,t8)

( d3, [t7, t13) )

( d2, [t4, t9) )

McCain[t8,t12)

( d5, [t11, t16) )

( d4, [t9, t11) )

Index Partitioning

d1d2

d3

d4d5

McCainMcCain

Each version is a new entry, hence

long list:need partitioning

Avishek Anand

Vertical Partitioning - Query Processing

26

• Dictionary or Lexicon should contain partitioning information

• For each temporal query, select a subset of affected partitions and only read them

• Filter postings which do not overlap with query time interval

time interval query

“hannover”

term partition offset

hannover [t1 - t5) 12646



hannover [t25 - t43) 15324

Avishek Anand

Optimal Approaches

27

• Performance Optimal Approach

• Keeps one posting list per every elementary time interval,

• Achieves optimal performance but large space overhead

• Space Optimal Approach

• No replication of postings, no blowup, sub-optimal performance

3.7 Partitioning Strategies

3.7.1 Performance-Optimal Approach

As discussed in Section 3.5, the performance when processing a time-point queryq t on a TTIX instance is influenced adversely by the wasted I/O due to read butfiltered-out postings. Temporal coalescing implicitly addresses this problem byreducing the number of postings and the space consumption of the posting listsscanned, but still a significant overhead remains. We now tackle this problemand describe temporal partitioning strategies geared at time-point queries thatdetermine for each term v a set of posting lists that should be kept in the index.

timet1 t2 t3 t4 t5 t6 t7 t8 t9 t10

d1

d2

d3

document

1 2 3 4

5 6 7

8 9 10

Figure 3.6: Partitioning illustrated

We illustrate the trade-offs of temporal partitioning for time-point queries us-ing the example given in Figure 3.6. The figure shows a total of ten postingsbelonging to term v and three different documents d1, d2, and d3. For ease ofdescription, we have numbered boundaries of valid-time intervals in increas-ing time-order, as t1, . . . , t10 and numbered the postings themselves as 1, . . . , 10.Now, consider that we want to process a time-point query with t 2 [t1, t2). Onlythree postings (namely, 1, 5, and 8) are valid at time t and therefore required toprocess the query. In the worst case, if only a single list Lv : [t1, t10) is con-tained in our index, we have to read all ten postings. In the best case, if the listLv : [t1, t2) is contained in the index, we achieve the optimal query-processingperformance, reading only the three postings required to answer the query.

This last observation suggests one strategy to eliminate the problem of filtered-out postings entirely. By choosing

Pv = Ev (3.31)

and thus keeping a posting list for every elementary time interval, for any querytime-point t only the postings valid at that time are read, so that the optimal

75

Avishek Anand

Partitioning Strategies

28

• Given an input sequence of intervals how do we partition them into sublists ?

• Space Bound Materialization Approach : We have a limited budget for space, need to maximize our performance

• Performance Guarantee Approach: For any query we need a guarantee on the performance loss, need to minimize blowup

Trade-off size and performance

Avishek Anand

Partitioning Strategies

28

• Given an input sequence of intervals how do we partition them into sublists ?

• Space Bound Materialization Approach : We have a limited budget for space, need to maximize our performance

• Performance Guarantee Approach: For any query we need a guarantee on the performance loss, need to minimize blowup

Trade-off size and performance

Avishek Anand

Space Bound Approach

29

• Minimize expected number of postings read for a time-point query, while ensuring that the index contains at most κ times the optimal number of postings

• Optimal solution computable in O( |S| × n2 ) time and O( |S| × n ) space using dynamic programming over prefix subproblems [ t1, tk ) and space bounds s ≤ κ· |Lv|

Space budget = 1/3 . (optimal number of postings)

Avishek Anand

Space Bound Approach

29

• Minimize expected number of postings read for a time-point query, while ensuring that the index contains at most κ times the optimal number of postings

• Optimal solution computable in O( |S| × n2 ) time and O( |S| × n ) space using dynamic programming over prefix subproblems [ t1, tk ) and space bounds s ≤ κ· |Lv|

Space budget = 1/3 . (optimal number of postings)

Avishek Anand

Performance Guarantee

30

• Minimize total number of postings kept in the index, while guaranteeing that for any time-point query the number of postings read is at most a factor γ worse than optimal

• Optimal solution computable in time O( |Lv| + n2 ) and space O( n2 ) using dynamic programming over prefix subproblems [ t1, tk )

performance guarantee = γ (times the number of optimal results at that time)

Avishek Anand

Performance Guarantee

30

• Minimize total number of postings kept in the index, while guaranteeing that for any time-point query the number of postings read is at most a factor γ worse than optimal

• Optimal solution computable in time O( |Lv| + n2 ) and space O( n2 ) using dynamic programming over prefix subproblems [ t1, tk )

performance guarantee = γ (times the number of optimal results at that time)

Avishek Anand

Exercise

31

• Performance Guarantee with γ = 2 (read at max twice the number of postings than optimal)

• Space bound approach with κ = 1.33 (1/3 more than overall space)

d1

d2d3

d4d5

d6

Avishek Anand

Horizontal Partitioning - Sharding

32

p1 p3p2 p4 p6p5

Can we partition a posting list without replicating postings ?

• Index size blowup due to replication of postings across slices

• Query processing inefficient if replicated postings are accessed multiple times

Avishek Anand


32

Relevant postings = 3 Postings read : 3

short time interval query

p1 p3p2 p4 p6p5




Avishek Anand


32

Relevant postings = 3 Postings read : 3

short time interval query

Relevant postings = 7 Postings read : 2 + 4 + 4 + 3 = 13

long time interval query

p1 p3p2 p4 p6p5




Avishek Anand

Index Sharding

33

• Partition documents in each posting list into sublists called shards

• Contents of each shard disjoint - no replication, no index blowup

• Postings stored in begin time order

• Access structure over each shard for efficient query processing

e1

e2e3

e4e5

e2

e1

e3e4

e5

Slicing

Shardingtime

time

doc.id

doc.id

Avishek Anand

Index Sharding

34

• All shards for a given query term are accessed

• Open-skip-scan on each shard assisted by impact lists

• Result list constructed by merging results from each shard

beijing olympics @ [8 Aug 2008, 24 Aug 2008]

Beijing Olympics

Avishek Anand

Index Sharding - Impact Lists

35

• Open - Each shard of a query term opened for access

• Skip - Given a query begin time seek to appropriate offset

• Scan - Read while postings still have overlap with query time interval

read until begin time < te

Impact list

1 2 4

t1 t12 t52 t100 t115 t150

1

2

3

4

5

5

Avishek Anand


35




seek to shard offset for tb


Impact list

1 2 4

t1 t12 t52 t100 t115 t150

1

2

3

4

5

5

Avishek Anand


35






Impact list

1 2 4

t1 t12 t52 t100 t115 t150

1

2

3

4

5

5

Avishek Anand


35






Impact list

1 2 4

t1 t12 t52 t100 t115 t150

1

2

3

4

5

5

Avishek Anand


35






Impact list

1 2 4

t1 t12 t52 t100 t115 t150

1

2

3

4

5

5

Avishek Anand


35






Impact list

1 2 4

t1 t12 t52 t100 t115 t150

1

2

3

4

5

5

Avishek Anand

Index Sharding - Staircase Property

36

• Wasted reads are processed but do not overlap with the query time interval

• Staircase property in a shard

• Intervals arranged in begin time order

• No interval completely subsumes another interval

• Eliminates wasted reads

wasted reads due to subsumption

Avishek Anand


36






query begin time


Avishek Anand


36






wasted readwasted read

query begin time


Avishek Anand


36







query begin time


staircase property

Avishek Anand

Index Sharding - Idealized Sharding

37

• Staircase property eliminates sequential accesses of postings non-overlapping with query time interval

• Minimizing number of shards is essential in minimizing number of random accesses

• Input : Set of postings/intervals corresponding to a postings list

• Problem Statement : Minimize the number of shards where each shard exhibits the staircase property

Greedy Algorithm exists which is proven to be optimal

Avishek Anand


38

Input :

• Postings arrive in begin time order

• For each posting chose a shard which

• does not violate staircase prop.

• has min. end time difference

• Runtime complexity O(n log n)

append posting to the end of chosen shard

p1

p2

p3

p4

p5

Avishek Anand


38

Input :

Shard 1







p1

p2

p3

p4

p5

Avishek Anand


38

Input :

Shard 1







p1

p2

p3

p4

p5

p1

Avishek Anand


38

Input :

Shard 1





• Runtime complexity O(n log n)Shard 2


p1

p2

p3

p4

p5

p1

p2

Avishek Anand


38

Input :

Shard 1







p1

p2

p3

p4

p5

p1

p2

p3

Avishek Anand


38

Input :

Shard 1







p1

p2

p3

p4

p5

p1

p2

p3

Avishek Anand


38

Input :

Shard 1







p1

p2

p3

p4

p5

p1

p2

p3

p4

Avishek Anand


38

Input :

Shard 1







p1

p2

p3

p4

p5

p1

p2

p3

p4

Avishek Anand


38

Input :

Shard 1







p1

p2

p3

p4

p5

p1

p2

p3

p4

p5

Avishek Anand


38

Input :

Shard 1







p1

p2

p3

p4

p5

p1

p2

p3

p4

p5

Avishek Anand

Exercise

39

• What is the Idealized sharding for the given input ?

• What are Impact lists for each idealized shard ?

• What is the worst case (in the number of shards) example for idealised sharding ?

d1

d2d3

d4d5

d6

Avishek Anand

Index Sharding - Challenges

40

• No wasted reads

• Many shards

• QP suffers. Why ?

Idealized Sharding

• Random accesses (RA) are typically much more expensive than sequential accesses (SA)

Avishek Anand


40

• No wasted reads

• Many shards

• QP suffers. Why ?

Idealized Sharding


More shards the more the random accesses to disk

• Allow wasted reads to balance SA and RA

Avishek Anand


41

Idealized Sharding


Relaxing the Sharding

Avishek Anand


41

Idealized Sharding


More shards the more the random accesses to disk

• Allow wasted reads to balance SA and RA

Relaxing the Sharding

Avishek Anand

Bounded Subsumption

42

• Balancing sequential and random accesses

• Bounded subsumption: no more than wasted reads for any query begin time


query begin time

• Bounded Subsumption Problem: Minimize number of shards s.t. each shard has bounded subsumption

⌘

Avishek Anand

Bounded Subsumption

42




query begin time

• Bounded Subsumption Problem: Minimize number of shards s.t. each shard has bounded subsumption

wasted read

⌘

Avishek Anand

Bounded Subsumption

42




query begin time

wasted read

Can we create shards solving the bounded subsumption problem?

⌘

Avishek Anand

Incremental Sharding

43

• Algorithm assigns incoming posting to a shard

• Posting inserted into shard buffer maintaining begin time order

• Top posting popped and appended to the shard end

t1

t2

t3

t4

time now

shard 1

shard 2

shard 3

Avishek Anand


43




t1

t2

t3

t4

nowtime

shard 1

shard 2

shard 3

Avishek Anand


43




t1

t2

t3

t4

nowtime

shard 1

shard 2

shard 3

Avishek Anand


43




t1

t2

t3

t4

nowtime

shard 1

shard 2

shard 3

Avishek Anand


43




t1

t2

t3

t4

nowtime

shard 1

shard 2

shard 3

Avishek Anand

Putting Things Together

44

• Start with initial posting list and partition them vertically or horizontally

• Given a query time interval for each term

• Access either subset of entire lists or part of all lists

• Intersect or Union the results for all query terms

Vertical Partitions Horizontal PartitionsOriginal

Postings List

Avishek Anand

Open Source Full-text Indexing Software

Avishek Anand

Task 1

46

• Given the Temporalia collection

• Document collections ~ 50k documents

• Query workload

• Index the doc collection both

• Text

• Time - publication dates + temporal references

• Should support temporal queries “earthquakes” @ 1998

• Result documents should have documents either published in 1998 or contain references which overlap with 1998

Avishek Anand

Task 1 - Doc Collection

47

Avishek Anand

Task 1 - Queries

48

• Expected finish date: Before the next task is distributed (2 weeks)

• Evaluate the results yourself and meet us after the lecture

Avishek Anand

✤ Informa(on retrieval: (h2p://www.ir.uwaterloo.ca/book/) Stefan Bü2cher, Google Inc. , Charles L. A. Clarke, Univ. of Waterloo, Gordon V. Cormack, Univ. of Waterloo

✤Managing Gigabytes: by Jus(n Zobel, Alistair Moffat, Ian Wi2en

✤ Indexing Methods for Web Archives: Avishek Anand

References

49

http://www.ir.uwaterloo.ca/book/

http://stefan.buettcher.org/

http://plg.uwaterloo.ca/~claclark/

http://plg.uwaterloo.ca/~gvcormac/

Documents

Indexing - l3s.deanand/tir15/lectures/ws15-tir-indexing.pdfTemporal Information Retrieval Index Organization, Construction, Temporal Query ... Each term descriptor contains the term