18
Sidra: a Flexible Distributed Indexing and Ranking Architecture for Web Search Miguel Costa, Mário J. Silva Universidade de Lisboa, Faculdade de Ciências, Departamento de Informática XLDB Research Group [email protected]

Sidra: a Flexible Distributed Indexing and Ranking Architecture for Web Search Miguel Costa, Mário J. Silva Universidade de Lisboa, Faculdade de Ciências,

Embed Size (px)

Citation preview

Page 1: Sidra: a Flexible Distributed Indexing and Ranking Architecture for Web Search Miguel Costa, Mário J. Silva Universidade de Lisboa, Faculdade de Ciências,

Sidra: a Flexible Distributed Indexing and Ranking Architecture for Web Search

Miguel Costa, Mário J. SilvaUniversidade de Lisboa, Faculdade de Ciências,

Departamento de InformáticaXLDB Research Group

[email protected]

Page 2: Sidra: a Flexible Distributed Indexing and Ranking Architecture for Web Search Miguel Costa, Mário J. Silva Universidade de Lisboa, Faculdade de Ciências,

Portuguese Web

● There is an identifiable community Web, that we call the Portuguese Web – The web of the people directly related to Portugal

● This is NOT a small community web– 10M population PT– 3+ M users– 4+ M pages

Page 3: Sidra: a Flexible Distributed Indexing and Ranking Architecture for Web Search Miguel Costa, Mário J. Silva Universidade de Lisboa, Faculdade de Ciências,

Tumba!(Temos um Motor de Busca Alternativo!)

● Public service– Community Web Search Engine

– Web Archive

– Research infrastructure

● See it in action at http://tumba.pt

Page 4: Sidra: a Flexible Distributed Indexing and Ranking Architecture for Web Search Miguel Costa, Mário J. Silva Universidade de Lisboa, Faculdade de Ciências,

Statistics

● Up to 20,000 queries/day ● 3,5 million documents under .PT – the deepest

crawl!● 95% responses under 0.5 sec

Page 5: Sidra: a Flexible Distributed Indexing and Ranking Architecture for Web Search Miguel Costa, Mário J. Silva Universidade de Lisboa, Faculdade de Ciências,

Tumba!

WebWeb

Cra

wle

rs

Rep

osi

tory

Ind

exin

g E

ng

ine

Ran

kin

g E

ng

ine

Pre

sen

tati

on

En

gin

e

SIDRA

Page 6: Sidra: a Flexible Distributed Indexing and Ranking Architecture for Web Search Miguel Costa, Mário J. Silva Universidade de Lisboa, Faculdade de Ciências,

crawling+archiving

WebStore(Contents

Repository)

WebWeb ViúvaNegra(Crawling

Engine)

Versus(Meta-dataRepository)

Seed URLs

“.PT” DNS Authority

User Input

Page 7: Sidra: a Flexible Distributed Indexing and Ranking Architecture for Web Search Miguel Costa, Mário J. Silva Universidade de Lisboa, Faculdade de Ciências,

Query Processing Architecture (indexing phase)

Word Index

PageAttributes

(Authority)

IndexDataStructsGenerator

Versus(Meta-dataRepository)

WebStore(Contents

Repository)

Page 8: Sidra: a Flexible Distributed Indexing and Ranking Architecture for Web Search Miguel Costa, Mário J. Silva Universidade de Lisboa, Faculdade de Ciências,

SIDRA - Word Index Data Structure

• 2 filesTerm {docID}

<Term,docID> {hit}

• Hit = position + attrib

• DocID assigned in Static Rank order

blue

dog

Terms documents ids

2 5

1 43

blue +2

hit

Terms + sids hits

blue +5

dog + 1

dog + 3

dog + 4

hit

hit

hit

hit

. . . . . .

hit ...

hit ...

hit ...

hit ...

hit ...

Page 9: Sidra: a Flexible Distributed Indexing and Ranking Architecture for Web Search Miguel Costa, Mário J. Silva Universidade de Lisboa, Faculdade de Ciências,

SIDRA – Index Range Partitioning

blue

dog

Terms documents ids

2 5

1 43

sea

xldb

. . . . . .

7 9 25

101

blue +2

hit

Terms + sids hits

blue +5

dog + 1

dog + 3

dog + 4

hit

hit

hit

hit

sea + 2

sea +10

xldb +7

xldb +9

xldb +25

. . . . . .

hit ...

hit ...

hit ...

hit ...

hit ...

hit

hit

hit

hit

hit

hit ...

hit ...

hit ...

hit ...

hit ...

Terms documents ids

Terms + sids hits

Host Host

do

cIds in

dex

hits in

dex

Page 10: Sidra: a Flexible Distributed Indexing and Ranking Architecture for Web Search Miguel Costa, Mário J. Silva Universidade de Lisboa, Faculdade de Ciências,

SIDRA - Ranking Engine

Word IndexWord Index

Word Index

QueryServer

Query Broker

PageAttributes

ClientsClients

Page 11: Sidra: a Flexible Distributed Indexing and Ranking Architecture for Web Search Miguel Costa, Mário J. Silva Universidade de Lisboa, Faculdade de Ciências,

Matching & Ranking Algorithm

Phase 1: Query Matching• QueryServers fetch

matching docIDs (pre-sorted in static ranking order)

• QueryBrokers merge results using distributed merge-sort algorithm (preserves ranking order)

Phase 2: Ranking• Pick N (1000) first results

from phase 1• Compute final rank using

hits data– Are terms also in title?

– What is the distance among query terms in the page?

– Terms in Bold, Italic?

Page 12: Sidra: a Flexible Distributed Indexing and Ranking Architecture for Web Search Miguel Costa, Mário J. Silva Universidade de Lisboa, Faculdade de Ciências,

Architecture

Page 13: Sidra: a Flexible Distributed Indexing and Ranking Architecture for Web Search Miguel Costa, Mário J. Silva Universidade de Lisboa, Faculdade de Ciências,

Index design

● horizontal/global partition ~ each QueryServer contains all documents of a criteria. e.g of a keywork

● allow searches on different criteria in parallel (partition parallelism)

● Brokers merge results received in parallel as they are being produced (pipelline parallelism)

Page 14: Sidra: a Flexible Distributed Indexing and Ranking Architecture for Web Search Miguel Costa, Mário J. Silva Universidade de Lisboa, Faculdade de Ciências,

Addressing Multi-dimensionality

● Generalization: page-rank (page importance measure) isn´t but one of possible ranking contexts.

● Query Servers may index data according to other dimensions– time– Location– ...

● Query Brokers perform the results “fusion”

Page 15: Sidra: a Flexible Distributed Indexing and Ranking Architecture for Web Search Miguel Costa, Mário J. Silva Universidade de Lisboa, Faculdade de Ciências,

Flexiblity / Scalability

• User requests may be balanced among multiple Presentation Engines

• Contents may be replicated

• Requests may be balanced among multiple Query Brokers

• Page Attributes may be replicated

• Query Brokers may balance requests to multiple Query Servers

• Multiple Query servers for a Word Index

• Word indexes may be replicated

Word IndexWord

IndexWord Index

QueryServer

Query Broker

PageAttributes

PresentationEngine

WebStore(Contents

Repository)

Page 16: Sidra: a Flexible Distributed Indexing and Ranking Architecture for Web Search Miguel Costa, Mário J. Silva Universidade de Lisboa, Faculdade de Ciências,

Non-functional properties

● load-balancing ~ components distribute requests to multiple replicas (round-robin or less loaded)

● fault-tolerance ~ components can detect high response times and redirect requests.

Page 17: Sidra: a Flexible Distributed Indexing and Ranking Architecture for Web Search Miguel Costa, Mário J. Silva Universidade de Lisboa, Faculdade de Ciências,

Results

● With 1 QueryServer and 1 Broker responds to workloads of 50 requests per second with an average time of 0.779 seconds

● With 2 QueryServers and 1 Brokerresponds to workloads of 110 requests per second with an average time of 0.871 seconds

● Extensive discussion in upcoming dissertation

Page 18: Sidra: a Flexible Distributed Indexing and Ranking Architecture for Web Search Miguel Costa, Mário J. Silva Universidade de Lisboa, Faculdade de Ciências,

Tumba!

● Modest effort:– 1 Prof., 4-5 graduate students, 4-5 servers for 2 years

● Still beta!– Fault-tolerance will require substantially more hardware

(replication)

– Periodic update willl demand more storage

– Full-time operators?

● Encouraging feedback

http://tumba.pt