Sidra: a Flexible Distributed Indexing and Ranking Architecture for Web Search Miguel Costa, Mário...

Preview:

Citation preview

Sidra: a Flexible Distributed Indexing and Ranking Architecture for Web Search

Miguel Costa, Mário J. SilvaUniversidade de Lisboa, Faculdade de Ciências,

Departamento de InformáticaXLDB Research Group

mjs@di.fc.ul.pt

Portuguese Web

● There is an identifiable community Web, that we call the Portuguese Web – The web of the people directly related to Portugal

● This is NOT a small community web– 10M population PT– 3+ M users– 4+ M pages

Tumba!(Temos um Motor de Busca Alternativo!)

● Public service– Community Web Search Engine

– Web Archive

– Research infrastructure

● See it in action at http://tumba.pt

Statistics

● Up to 20,000 queries/day ● 3,5 million documents under .PT – the deepest

crawl!● 95% responses under 0.5 sec

Tumba!

WebWeb

Cra

wle

rs

Rep

osi

tory

Ind

exin

g E

ng

ine

Ran

kin

g E

ng

ine

Pre

sen

tati

on

En

gin

e

SIDRA

crawling+archiving

WebStore(Contents

Repository)

WebWeb ViúvaNegra(Crawling

Engine)

Versus(Meta-dataRepository)

Seed URLs

“.PT” DNS Authority

User Input

Query Processing Architecture (indexing phase)

Word Index

PageAttributes

(Authority)

IndexDataStructsGenerator

Versus(Meta-dataRepository)

WebStore(Contents

Repository)

SIDRA - Word Index Data Structure

• 2 filesTerm {docID}

<Term,docID> {hit}

• Hit = position + attrib

• DocID assigned in Static Rank order

blue

dog

Terms documents ids

2 5

1 43

blue +2

hit

Terms + sids hits

blue +5

dog + 1

dog + 3

dog + 4

hit

hit

hit

hit

. . . . . .

hit ...

hit ...

hit ...

hit ...

hit ...

SIDRA – Index Range Partitioning

blue

dog

Terms documents ids

2 5

1 43

sea

xldb

. . . . . .

7 9 25

101

blue +2

hit

Terms + sids hits

blue +5

dog + 1

dog + 3

dog + 4

hit

hit

hit

hit

sea + 2

sea +10

xldb +7

xldb +9

xldb +25

. . . . . .

hit ...

hit ...

hit ...

hit ...

hit ...

hit

hit

hit

hit

hit

hit ...

hit ...

hit ...

hit ...

hit ...

Terms documents ids

Terms + sids hits

Host Host

do

cIds in

dex

hits in

dex

SIDRA - Ranking Engine

Word IndexWord Index

Word Index

QueryServer

Query Broker

PageAttributes

ClientsClients

Matching & Ranking Algorithm

Phase 1: Query Matching• QueryServers fetch

matching docIDs (pre-sorted in static ranking order)

• QueryBrokers merge results using distributed merge-sort algorithm (preserves ranking order)

Phase 2: Ranking• Pick N (1000) first results

from phase 1• Compute final rank using

hits data– Are terms also in title?

– What is the distance among query terms in the page?

– Terms in Bold, Italic?

Architecture

Index design

● horizontal/global partition ~ each QueryServer contains all documents of a criteria. e.g of a keywork

● allow searches on different criteria in parallel (partition parallelism)

● Brokers merge results received in parallel as they are being produced (pipelline parallelism)

Addressing Multi-dimensionality

● Generalization: page-rank (page importance measure) isn´t but one of possible ranking contexts.

● Query Servers may index data according to other dimensions– time– Location– ...

● Query Brokers perform the results “fusion”

Flexiblity / Scalability

• User requests may be balanced among multiple Presentation Engines

• Contents may be replicated

• Requests may be balanced among multiple Query Brokers

• Page Attributes may be replicated

• Query Brokers may balance requests to multiple Query Servers

• Multiple Query servers for a Word Index

• Word indexes may be replicated

Word IndexWord

IndexWord Index

QueryServer

Query Broker

PageAttributes

PresentationEngine

WebStore(Contents

Repository)

Non-functional properties

● load-balancing ~ components distribute requests to multiple replicas (round-robin or less loaded)

● fault-tolerance ~ components can detect high response times and redirect requests.

Results

● With 1 QueryServer and 1 Broker responds to workloads of 50 requests per second with an average time of 0.779 seconds

● With 2 QueryServers and 1 Brokerresponds to workloads of 110 requests per second with an average time of 0.871 seconds

● Extensive discussion in upcoming dissertation

Tumba!

● Modest effort:– 1 Prof., 4-5 graduate students, 4-5 servers for 2 years

● Still beta!– Fault-tolerance will require substantially more hardware

(replication)

– Periodic update willl demand more storage

– Full-time operators?

● Encouraging feedback

http://tumba.pt

Recommended