36
Fast Phrase Querying With Combined Indexes HUGH E. WILLIAMS, JUSTIN ZOBEL, and DIRK BAHLE RMIT University 2004 Burak Görener 201195001 Doğuş University

Fast Phrase Querying With Combined Indexes

Embed Size (px)

DESCRIPTION

Fast Phrase Querying With Combined Indexes. HUGH E. WILLIAMS, JUSTIN ZOBEL, and DIRK BAHLE RMIT University 2004 Burak Görener 201195001 Doğuş University. Search Engines. Need to evaluate queries extremely fast. Involve phrases. Supported with low disk overheads. Introduction. - PowerPoint PPT Presentation

Citation preview

Fast Phrase Querying With CombinedIndexes

HUGH E. WILLIAMS, JUSTIN ZOBEL, and DIRK BAHLE

RMIT University

2004

Burak Görener 201195001

Doğuş University

Search Engines . . .

Need to evaluate queries extremely fast.

Involve phrases.

Supported with low disk overheads.

Introduction

Most queries consist of simple list of words.

Some of query terms must be ordered and adjacent.

Typically by enclosing and in quotation mark. Standart way to evaluate phrase queries to use inverted index.

Inverted Index(II) use List of posting (each posting include a document ID )

List of offsets.(ordinal word position) II work with combinating the posting list for the query

terms occurs in the documents. This process is fast but does not mean!

Because of common words.

Introduction Cont.

A common term require several megabytes for each GB of Inverted Index's Data.

A crude solution is to use stopping The Google neglected common words in phrase queries until

2002

Until this, many more queries evaluated incorrectly.

Introduction Cont.

A Nextword index is like a Inverted Index

Nextword index use Index term(firstword and nextword)

Nextword index work Each index term(firstword) is a list of the words(nextword) that

follow that term. Firstword and nextword occur as a pair.

As a disadvantages is its storage size. Must be processed linearly(Nextword process).

With direct indexing, indexed 10 k most common phase queries reduces query evalution time by over %10.

Next . . .

Introduction (Fin)

Properties of Phrase Queries

Inverted Index in Phrase Queries

Partial Phrase and Nextword Indexing

Combining Phrase and Inverted Indexing

Experimental Result

Conclusion

Properties of Queries

In this research, used query logs by Excite from 1997 and 1999

These logs have similar properties. 1.583.922 queries including duplicates. % 8.3 of these were explicit phrase queries. In totaly, %5-10 are explicit.

Queries matched in an around 20 GB Web dataset. Pharses queries , 11.103 or % 8.4 include one of three

common words as the, to and of. In totaly, %14.4 of phase queries include one of 20 commonest

terms.

Properties of Queries

In this research, used query logs by Excite from 1997 and 1999

These logs have similar properties. 1.583.922 queries including duplicates. % 8.3 of these were explicit phrase queries. In totaly, %5-10 are explicit.

Queries matched in an around 20 GB Web dataset. Pharses queries , 11.103 or % 8.4 include one of three

common words as the, to and of. In totaly, %14.4 of phase queries include one of 20 commonest

terms.

Properties of Queries

Common words played important role!

In tower of london, can be safely neglected during evalution.

But in the spacial name like movie name or brand name End of days or The who

These queries are diffucult to evaluate with stopwords removed.

Also query logs include; To be or not to be Who are we All in all

Properties of Queries

Stopping may yield efficiency gain,

But, significant number of queries cannot be correctly evaluated.

Basic query is tower of london, it is evaluated as tower – london Stopped first 3 commenest word Result 309 x 10^6 matches Stopped first 20 commenest word Result 490 x 10^6 matches Stopped first 254 commenest word Result 1693 x 10^6 matches

Most mixed problem in form and to.

Dismathes flights from london and flights to london

Properties of Queries

Other dismathes examples; So many roads ->how many road Man in the moon -> man on the moon

Among the phase queries include,

Generaly 2 words. %34 in 3 words. %1.3 in 6 or more word.

Properties of Queries

Testing Data

Called WT10g collection. This is 10.27 GB Web data (HTML) and 1.67 million doc. It is crawed in 1997

Most Frequent Words and Word Pairs

Next . . .

Introduction (Fin)

Properties of Phrase Queries (Fin)

Inverted Index in Phrase Queries

Partial Phrase and Nextword Indexing

Combining Phrase and Inverted Indexing

Experimental Result

Conclusion

Inverted Index

It is a standart method for supporting queries on large text DB.

It is fast for ranked query evalution.

It use two level structure

Upper level is a vocabulary or lexicon Lower level is set of posting list.

Zobel and Moffat (1998) notation;

D is document ID F dt frequent of term indocument D OX is position of term in document D

Inverted Index

Let's look "hatful of hollow"

This is general structure of Inverted Index

Term and Document frequences contain in it.

Word positions are ordinal.

Inverted Index

Inverted Index Evaluator

It is open source MG text retrival engine Descirebed by Witten et al.(1999)

Inverted Index data size for WT10g is 1,429 MB

Stopped word data size is 427 MB (490 stopwords) Stopped Inverted Index size is 1,002 MB

Inverted Index

Result of Inverted Index performing

Next . . .

Introduction (Fin)

Properties of Phrase Queries (Fin)

Inverted Index in Phrase Queries (Fin)

Partial Phrase and Nextword Indexing

Combining Phrase and Inverted Indexing

Experimental Result

Conclusion

Phrase Indexes

Phase Index is an Inverted Index where items stored as a word sequence.

A parcial phrase index with a vocabulary of five popular phrases.

Phrase Indexes

A phrase index with L = 3 cannot be used efficient to 2 word queries

L=> 2 are stored as term in conventional inverted index. L= 2 is organized for partial nextword indexes.

Parcial Phrase Index

It is notation like;

D is document ID, f dp is term frequence of document. Offsets are not stored. The sets saves the cost of merging lists.

Phrase Indexes

As examples are

Lord of the rings(19) and birtney spears(59)* in 2001 Given a stream of queries over a long period and fixed volume

of memory

May also be required to update the vocabulary or replace least frequently used queries.

This research do not experiment with this approach.

* is number of same request(Query)

Nextword Indexes

A phrase query can never be less than two word.

Nextword index is similar to inverted index.

Term representation;

F wp is document frequence. D is document ID. F dwp is frequent of term of D. OX is position of term in D.

Nextword Indexes A nextword index with two firstwords.

An example : boulder municipal employee credit union

This can be grouped like boulder-municipal,employee-credit and credit-union

Other example : historical railroads in new hamsphire

It can grouped as railroads in in preferences to in new AS railroad is much less common than in.

Nextword Indexes

The nextword index for the WT10g collection is 2.75 GB in size.

It is exactly twice that of an inverted index file. The nextword index involves more complex structures than does

processing with inverted index.

Differences between Inverted Index and Nextword Index in queries

Next . . .

Introduction (Fin)

Properties of Phrase Queries (Fin)

Inverted Index in Phrase Queries (Fin)

Partial Phrase and Nextword Indexing (Fin)

Combining Phrase and Inverted Indexing

Experimental Result

Conclusion

Combining Nextword and Inverted Indexing

Propose that common words only be used as firstword in a parcial nextword index.

Combining Phrase and Inverted Indexing

As an example, the query is new york city

can be resolved using the partial phrase index find the locations of new york and merging with the

inverted index postings list for city.

Three-Way Index Combination

It is include a parcial nextword, partial phrase, and full inverted index.

Next . . .

Introduction (Fin)

Properties of Phrase Queries (Fin)

Inverted Index in Phrase Queries (Fin)

Partial Phrase and Nextword Indexing (Fin)

Combining Phrase and Inverted Indexing (Fin)

Experimental Result

Conclusion

Experimental Result

All expriments were run on intel 700 Mhz Pentium III based server with 2 GB of memory.

Result of Inverted and Nextword Indexing

This table is include the memory usage of the combinations.

Result of Inverted and Nextword Indexing

Result of n terms queries with Inverted and Nextword Indexing

Result of Inverted Index and Phrase

This test evaluate in 100, 1000, 10000 most frequent distinct queries

Phrase index was less than %0.1of the collection 2.1MB, 4,8 MB, 12,8 MB

In query logs, an american dictionary of the english language AND los angeles department of

water and power are in 10000 common queries. Experimental results,

Result of Inverted Index, Nextword Index and Phrase

This result is based 66000 queries' testing with using phase queries as common 10000 queries, nextword(only stopped word) and inverted indexing.

Next . . .

Introduction (Fin)

Properties of Phrase Queries (Fin)

Inverted Index in Phrase Queries (Fin)

Partial Phrase and Nextword Indexing (Fin)

Combining Phrase and Inverted Indexing (Fin)

Experimental Result(Fin)

Conclusion

Conclusion