Indexing and Caching In

Embed Size (px)

Citation preview

  • 7/31/2019 Indexing and Caching In

    1/15

    11MX56

    INDEXING AND CACHING IN

    SEARCH ENGINES

  • 7/31/2019 Indexing and Caching In

    2/15

    THE PROBLEM ?

    The users want the results the moment they query

    What the engine has to handle ?

    A staggering amount of over 4 billion web pages !

    And over a million queries per minute !

    The response time must be as immediate as possible.

    Consume least amount of resources possible.

  • 7/31/2019 Indexing and Caching In

    3/15

    SOME SEARCH STATISTICS TO NOTE

    The 63.7% of the queries are unique.

    An approx. 34% or only 1/3rd of the search queries submitted

    are repeated.

    58% of the users only view the 1st page of the search result.

    (average considering popular search giants Google, Yahoo and Ask)

    No more than 12% of users browse through more than 3 pages.

  • 7/31/2019 Indexing and Caching In

    4/15

    THE SOLUTION

    With a Cache With out Cache

    QUERY QUERYRESULT RESULT

    CACHE

    HIT

    Web

    QUERY SERVER

    Web

    QUERY SERVER

    Yes

    No

  • 7/31/2019 Indexing and Caching In

    5/15

    Search

    Same

    Different

    36% of all queries have

    been retrieved before.

    The stats show that

    most people are looking

    for the same thing when

    using a search engine.

    WHAT DO WE CACHE/INDEX ?

  • 7/31/2019 Indexing and Caching In

    6/15

    1. Direct Cache

    2. Inverted Index/List

    3. Two-Level

    4. N-Level

    VARIANTS

  • 7/31/2019 Indexing and Caching In

    7/15

    Stores a link to pages

    containing the tokens in all

    frequently/recently searched

    queries.

    Can only be fetched after the

    query is processed and

    tokenized.

    Stores the top few results of a

    query that are searched

    frequently/recently.

    Can be fetched even before

    the query is processed and

    tokenized.

    VARIANTS

    Direct Cache Inverted Index

  • 7/31/2019 Indexing and Caching In

    8/15

    Allocate a rank based list that

    can accommodate a certain

    number of result pages.

    When the list is full and a new

    page needs to be cached, the

    least FREQUENTLY used page

    is removed from the cache.

    Allocate a queue that can

    accommodate a certain number

    of result pages.

    When the queue is full and a new

    page needs to be cached, the

    least RECENTLY used page is

    removed from the cache.

    POLICIES

    LRU (Least Recently Used) LFU (Least Frequently Used)

  • 7/31/2019 Indexing and Caching In

    9/15

    AN ADVANCED POLICY

    Probability Driven Cache

    o Users search in sessions, the next query will probably be related to the previous

    query.

    o This is currently in use by Google. Noted by its related searches given at the

    bottom of the result page.

  • 7/31/2019 Indexing and Caching In

    10/15

    INDEXING

    Steps and not just Types !

    1. Forward Index

    2. Inverted Index

  • 7/31/2019 Indexing and Caching In

    11/15

    1. This, is, what, it, is

    2. What, is, it

    3. It, is, a, panther

    Page 1

    This is what it is.

    Page 2

    what is it ?

    Page 3

    It is a panther.

    FORWARD INDEX

    Pages Forward Index

  • 7/31/2019 Indexing and Caching In

    12/15

    This - 1

    Is 1,2,3

    What 1,2

    It - 1,2,3

    Is 1,2,3

    A - 3

    Panther - 3

    INVERTED INDEX

    Inverted IndexForward Index

    1. This, is, what, it, is

    2. What, is, it

    3. It, is, a, panther

    Search term like what is it ? will givepages 1, 2 as best results.

    But It occurs in the same order in only 1page i.e. 2 and ranked on top.

  • 7/31/2019 Indexing and Caching In

    13/15

    TROUBLES ENCOUNTERED

    The indexed documents correspond to an older version of the

    web pages.

    The documents matched for a cached query correspond to an

    older version of the index.

    Periodic Refresh Has to be done to tackle above troubles !

  • 7/31/2019 Indexing and Caching In

    14/15

    IMPACT

    Direct Cache Inverted Index

  • 7/31/2019 Indexing and Caching In

    15/15

    References

    Performance of Inverted List Caching, CIS Department, Brooklyn University,

    NY, USA

    A Refreshing Perspective of search engine caching, Yahoo! Research,

    Barcelona, Spain

    Some help from Wiki as usual

    THANK YOU