Sidra: a Flexible Distributed Indexing and Ranking Architecture for Web Search Miguel Costa, Mário...

Sidra: a Flexible Distributed Indexing and Ranking Architecture for Web Search

Miguel Costa, Mário J. SilvaUniversidade de Lisboa, Faculdade de Ciências,

Departamento de InformáticaXLDB Research Group

mjs@di.fc.ul.pt

Portuguese Web

● There is an identifiable community Web, that we call the Portuguese Web – The web of the people directly related to Portugal

● This is NOT a small community web– 10M population PT– 3+ M users– 4+ M pages

Tumba!(Temos um Motor de Busca Alternativo!)

● Public service– Community Web Search Engine

– Web Archive

– Research infrastructure

● See it in action at http://tumba.pt

Statistics

● Up to 20,000 queries/day ● 3,5 million documents under .PT – the deepest

crawl!● 95% responses under 0.5 sec

Tumba!

WebWeb

crawling+archiving

WebStore(Contents

Repository)

WebWeb ViúvaNegra(Crawling

Engine)

Versus(Meta-dataRepository)

Seed URLs

“.PT” DNS Authority

User Input

Query Processing Architecture (indexing phase)

Word Index

PageAttributes

(Authority)

IndexDataStructsGenerator

Versus(Meta-dataRepository)

WebStore(Contents

Repository)

SIDRA - Word Index Data Structure

• 2 filesTerm {docID}

<Term,docID> {hit}

• Hit = position + attrib

• DocID assigned in Static Rank order

Terms documents ids

blue +2

Terms + sids hits

blue +5

dog + 1

dog + 3

dog + 4

. . . . . .

hit ...

SIDRA – Index Range Partitioning

Terms documents ids

. . . . . .

7 9 25

blue +2

Terms + sids hits

blue +5

dog + 1

dog + 3

dog + 4

sea + 2

sea +10

xldb +7

xldb +9

xldb +25

. . . . . .

hit ...

Terms documents ids

Terms + sids hits

Host Host

cIds in

hits in

SIDRA - Ranking Engine

Word IndexWord Index

Word Index

QueryServer

Query Broker

PageAttributes

ClientsClients

Matching & Ranking Algorithm

Phase 1: Query Matching• QueryServers fetch

matching docIDs (pre-sorted in static ranking order)

• QueryBrokers merge results using distributed merge-sort algorithm (preserves ranking order)

Phase 2: Ranking• Pick N (1000) first results

from phase 1• Compute final rank using

hits data– Are terms also in title?

– What is the distance among query terms in the page?

– Terms in Bold, Italic?

Architecture

Index design

● horizontal/global partition ~ each QueryServer contains all documents of a criteria. e.g of a keywork

● allow searches on different criteria in parallel (partition parallelism)

● Brokers merge results received in parallel as they are being produced (pipelline parallelism)

Addressing Multi-dimensionality

● Generalization: page-rank (page importance measure) isn´t but one of possible ranking contexts.

● Query Servers may index data according to other dimensions– time– Location– ...

● Query Brokers perform the results “fusion”

Flexiblity / Scalability

• User requests may be balanced among multiple Presentation Engines

• Contents may be replicated

• Requests may be balanced among multiple Query Brokers

• Page Attributes may be replicated

• Query Brokers may balance requests to multiple Query Servers

• Multiple Query servers for a Word Index

• Word indexes may be replicated

Word IndexWord

IndexWord Index

QueryServer

Query Broker

PageAttributes

PresentationEngine

WebStore(Contents

Repository)

Non-functional properties

● load-balancing ~ components distribute requests to multiple replicas (round-robin or less loaded)

● fault-tolerance ~ components can detect high response times and redirect requests.

Results

● With 1 QueryServer and 1 Broker responds to workloads of 50 requests per second with an average time of 0.779 seconds

● With 2 QueryServers and 1 Brokerresponds to workloads of 110 requests per second with an average time of 0.871 seconds

● Extensive discussion in upcoming dissertation

Tumba!

● Modest effort:– 1 Prof., 4-5 graduate students, 4-5 servers for 2 years

● Still beta!– Fault-tolerance will require substantially more hardware

(replication)

– Periodic update willl demand more storage

– Full-time operators?

● Encouraging feedback

http://tumba.pt

Sidra: a Flexible Distributed Indexing and Ranking Architecture for Web Search Miguel Costa, Mário...

Documents

SIDRA 2012 talk

Wyndham North Traffic Modelling and SIDRA Analysis · AECOM Wyndham North Traffic Modelling and SIDRA Analysis 31 May 2013 Wyndham North Traffic Modelling and SIDRA Analysis Prepared

2 Sidra basic elements - woqod.com Guidline/1.2_sidra_basic_mast… · SIDRA FLOWER MARK SIDRA WORD MARK SIDRA MARK PREFERED VERSION SIDRA MARK ON ORANGE SIDRA MARK ON WHITE THE SIDRA

blog.lusofonias.net · Web view2014/11/25 · Segundo o jornalista brasileiro Santana Mota, correspondente em Lisboa do diário "O Estado de São Paulo", Mário Soares ter-lhe-á

arabvisiondubai.comarabvisiondubai.com/new/wp-content/uploads/2017/02/Sidra-Brochur… · 4 SIDRA AT DUBAI HILLS ESTATE SIDRA AT DUBAI HILLS ESTATE 5. Dubai Hills Estate is your personal

la sidra 80 - Asturies

Xuanon Cancion de La Sidra

Getting Started: SIDRA

EMAIL PROTECTION SERVICE lies on CLOUD...Rua Mário Cesariny, 6 - Esc 4, Entrecampos 1600-313 Lisboa Lisboa, Portugal HQ COVILHÃ Parkurbis - Parque de Ciência e Tecnologia 6200-865

Sidra Rehman

La Sidra 79

Working Paper - CICEE · University of Aveiro, Portugal Mário Coutinho dos Santos Universidade Autónoma de Lisboa, Portugal Católica-Lisbon School of Business and Economics, Catholic

Iqra and Sidra Butt

Jean-Marc Bottazzi, Jaime Luque, Mário Páscoa To cite this version · Campolide; 1099-032 Lisboa, PORTUGAL. e-mail: pascoa@fe.unl.pt 1 Document de Travail du Centre d'Economie de

REGIMENTO INTERNO DA ALE-GO Professor Mário Elesbão Email: marioelesbao@gmail.com Facebook: Mário Elesbão

Sundas and Sidra

SIDRA Intersection Guideline ADOPTED

Sidra Symposia Series - Sidra Eventsevents.sidra.org/wp-content/uploads/2016/08/SIDRA_MOLECULAR... · 4 1 Damien Chaussabel, PhD Director of System Biology Division Sidra Medical

schede A3 sidra

Clinical Report Sidra