View
213
Download
0
Category
Tags:
Preview:
Citation preview
Sidra: a Flexible Distributed Indexing and Ranking Architecture for Web Search
Miguel Costa, Mário J. SilvaUniversidade de Lisboa, Faculdade de Ciências,
Departamento de InformáticaXLDB Research Group
mjs@di.fc.ul.pt
Portuguese Web
● There is an identifiable community Web, that we call the Portuguese Web – The web of the people directly related to Portugal
● This is NOT a small community web– 10M population PT– 3+ M users– 4+ M pages
Tumba!(Temos um Motor de Busca Alternativo!)
● Public service– Community Web Search Engine
– Web Archive
– Research infrastructure
● See it in action at http://tumba.pt
Statistics
● Up to 20,000 queries/day ● 3,5 million documents under .PT – the deepest
crawl!● 95% responses under 0.5 sec
Tumba!
WebWeb
Cra
wle
rs
Rep
osi
tory
Ind
exin
g E
ng
ine
Ran
kin
g E
ng
ine
Pre
sen
tati
on
En
gin
e
SIDRA
crawling+archiving
WebStore(Contents
Repository)
WebWeb ViúvaNegra(Crawling
Engine)
Versus(Meta-dataRepository)
Seed URLs
“.PT” DNS Authority
User Input
Query Processing Architecture (indexing phase)
Word Index
PageAttributes
(Authority)
IndexDataStructsGenerator
Versus(Meta-dataRepository)
WebStore(Contents
Repository)
SIDRA - Word Index Data Structure
• 2 filesTerm {docID}
<Term,docID> {hit}
• Hit = position + attrib
• DocID assigned in Static Rank order
blue
dog
Terms documents ids
2 5
1 43
blue +2
hit
Terms + sids hits
blue +5
dog + 1
dog + 3
dog + 4
hit
hit
hit
hit
. . . . . .
hit ...
hit ...
hit ...
hit ...
hit ...
SIDRA – Index Range Partitioning
blue
dog
Terms documents ids
2 5
1 43
sea
xldb
. . . . . .
7 9 25
101
blue +2
hit
Terms + sids hits
blue +5
dog + 1
dog + 3
dog + 4
hit
hit
hit
hit
sea + 2
sea +10
xldb +7
xldb +9
xldb +25
. . . . . .
hit ...
hit ...
hit ...
hit ...
hit ...
hit
hit
hit
hit
hit
hit ...
hit ...
hit ...
hit ...
hit ...
Terms documents ids
Terms + sids hits
Host Host
do
cIds in
dex
hits in
dex
SIDRA - Ranking Engine
Word IndexWord Index
Word Index
QueryServer
Query Broker
PageAttributes
ClientsClients
Matching & Ranking Algorithm
Phase 1: Query Matching• QueryServers fetch
matching docIDs (pre-sorted in static ranking order)
• QueryBrokers merge results using distributed merge-sort algorithm (preserves ranking order)
Phase 2: Ranking• Pick N (1000) first results
from phase 1• Compute final rank using
hits data– Are terms also in title?
– What is the distance among query terms in the page?
– Terms in Bold, Italic?
Architecture
Index design
● horizontal/global partition ~ each QueryServer contains all documents of a criteria. e.g of a keywork
● allow searches on different criteria in parallel (partition parallelism)
● Brokers merge results received in parallel as they are being produced (pipelline parallelism)
Addressing Multi-dimensionality
● Generalization: page-rank (page importance measure) isn´t but one of possible ranking contexts.
● Query Servers may index data according to other dimensions– time– Location– ...
● Query Brokers perform the results “fusion”
Flexiblity / Scalability
• User requests may be balanced among multiple Presentation Engines
• Contents may be replicated
• Requests may be balanced among multiple Query Brokers
• Page Attributes may be replicated
• Query Brokers may balance requests to multiple Query Servers
• Multiple Query servers for a Word Index
• Word indexes may be replicated
Word IndexWord
IndexWord Index
QueryServer
Query Broker
PageAttributes
PresentationEngine
WebStore(Contents
Repository)
Non-functional properties
● load-balancing ~ components distribute requests to multiple replicas (round-robin or less loaded)
● fault-tolerance ~ components can detect high response times and redirect requests.
Results
● With 1 QueryServer and 1 Broker responds to workloads of 50 requests per second with an average time of 0.779 seconds
● With 2 QueryServers and 1 Brokerresponds to workloads of 110 requests per second with an average time of 0.871 seconds
● Extensive discussion in upcoming dissertation
Tumba!
● Modest effort:– 1 Prof., 4-5 graduate students, 4-5 servers for 2 years
● Still beta!– Fault-tolerance will require substantially more hardware
(replication)
– Periodic update willl demand more storage
– Full-time operators?
● Encouraging feedback
http://tumba.pt
Recommended