IXE: Ideare Indexing Engine Ideare SpA

Preview:

Citation preview

IXE: IIXE: Idearedeare I Indexing ndexing EEnginengine

Ideare SpAIdeare SpA

www.ideare.comwww.ideare.com

Keep it as simple as possible but not Keep it as simple as possible but not simplersimpler..

Albert EinsteinAlbert Einstein

HistoryHistory

6/966/96 IOL-University cooperationIOL-University cooperation10/9610/96 Arianna: first SE for Italian WebArianna: first SE for Italian Web1/981/98 EUROsearch projectEUROsearch project10/9810/98 WebNet: Categorization by WebNet: Categorization by

ContextContext10/9810/98 Automated Arianna catalogueAutomated Arianna catalogue11/9811/98 WWW8 paper on CategorizationWWW8 paper on Categorization1/991/99 Ideare spin-offIdeare spin-off3/003/00 Tiscali purchases 60% of IdeareTiscali purchases 60% of Ideare6/016/01 First release of IXEFirst release of IXE01-0201-02 Large scale deploymentLarge scale deployment

GoalsGoals

Specialized tool (indexing and Specialized tool (indexing and search)search)

C++ framework with high-level C++ framework with high-level primitivesprimitives– Applications built with few lines of C++– Specialization by inheritance

High performanceHigh performanceScalabilityScalabilitySimple to maintainSimple to maintain

ApproachApproach

Quick and cleanQuick and cleanSignificant effort in designing best Significant effort in designing best

abstractionsabstractionsRefined through extensive usageRefined through extensive usage

Fundamental IdeasFundamental Ideas

Rely on hardware caching and mmapRely on hardware caching and mmap– Keep data as compact as possible– Stucture on disk same as used by

algorithmsRely on good data structures and Rely on good data structures and

algorithmsalgorithms– STL

Specialize data structuresSpecialize data structures– For indexing– For search

IndexingIndexing

Posting List are created in memoryPosting List are created in memory– Provide as much memory as possible to

indexing machinesWhen size of lists reaches a When size of lists reaches a

threshold, dump partial index to diskthreshold, dump partial index to diskPerform final merging of partial Perform final merging of partial

indexesindexesMerging operation used also for:Merging operation used also for:

– Incremental indexing– Distributed indexing

ColorsColors

Generalization of Google hits Generalization of Google hits properties (anchor, size, properties (anchor, size, capitaliation)capitaliation)

Similar to Fulcrum zonesSimilar to Fulcrum zonesUsed for rankingUsed for ranking

– E.g. title words contribute more to rank of document

and selective queriesand selective queriestext matches author = attardi

Early 2001Early 2001

IXE releasedIXE released Ideare starts deploymentIdeare starts deployment

– June: Italian Web (50 Mil. documents) served by 3 PCs with IXE

– Fall: expanded to Germany, France, Switzerland

– Fall: Video, Image, Shopping search on IXE

October: evaluation and negotiations October: evaluation and negotiations with major German search portalwith major German search portal

OverallOverall

IXE runs on PCs (better than Solaris IXE runs on PCs (better than Solaris or Alpha)or Alpha)

Fully self-contained libraryFully self-contained library Its own multithreaded serverIts own multithreaded serverDistributed crawlerDistributed crawlerDistributed indexing and mergeDistributed indexing and mergeParallel searchParallel searchWeb Service architectureWeb Service architecture .NET managed code interface.NET managed code interface

FeaturesFeatures

Full text + phrase + proximityFull text + phrase + proximityBoolean queriesBoolean queriesColors: HTML, XML tagsColors: HTML, XML tagsMultiple collectionsMultiple collections Incremental indexingIncremental indexingScalability:Scalability:

– TeraByte collections– Distributed multithreaded servers

Features (2)Features (2)

Pluggable Document Readers: Pluggable Document Readers: Office, PDFOffice, PDF

Compressed document cacheCompressed document cacheDocument snippets with Document snippets with

highlightshighlightsProgrammable query syntaxProgrammable query syntaxClustering of results (prototype)Clustering of results (prototype)

TechnologyTechnology

C++ OO architectureC++ OO architecture Fast indexingFast indexing

– Sort-based inversion Fast searchFast search

– Efficient algorithms and data structures– Query Compiler

• Small Adaptive Set Intersection– Suffix array with supra index– Memory mapped index files

Programmable API libraryProgrammable API library Template metaprogrammingTemplate metaprogramming Full Object Data BaseFull Object Data Base

ArchitectureArchitecture

GathererGatherer Table<DocInfo>Table<DocInfo>

IndexerIndexerLexiconPostings

Hit Lists

DocStore

mmap

Berkeley DB

name:time:size:

DocInfo

mmaplocal

cache

mmap

DocInfo DocInfo

name:time:size:

name:time:size:

DocInfo DocInfo

name:time:size:title:summary:type:

name:time:size:title:summary:type:

DocInfo DocInfo

name:time:size:title:summary:type:

name:time:size:title:summary:type:

ArchitectureArchitecture

Gatherers

.html, .doc, .pdf, .ps, .txt

Gatherers

.html, .doc, .pdf, .ps, .txt

MultithreadQuery

MultithreadQueryIndexersIndexers

IndexPosting

DocStore

Storing Objects in Relational TablesStoring Objects in Relational Tables

SQLSQLcreate table video (name varchar(256),

caption varchar(2048), format INT, PRIMARY KEY(name))

Template MetaprogrammingTemplate Metaprogramming

class Video : public DocInfo {class Video : public DocInfo {char*char* name;name;char*char* caption;caption;intint format;format;

META(Video, (SUPERCLASS(DocInfo),META(Video, (SUPERCLASS(DocInfo), VARKEY(name, 256),VARKEY(name, 256),

VARFIELD(caption, 2048),VARFIELD(caption, 2048),FIELD(format)));FIELD(format)));

};};

Programming Applications (C+Programming Applications (C++)+)

Collection<Video> videos(“CNN”);Collection<Video> videos(“CNN”);videos.insert(video1);videos.insert(video1);

Query q(“caption MATCHES Jordan and Query q(“caption MATCHES Jordan and format=wav”);format=wav”);

Cursor<Video> cursor(videos, q);Cursor<Video> cursor(videos, q);

while (cursor.hasNext())while (cursor.hasNext())cout << cursor.get();cout << cursor.get();

Small Adaptive Set IntersectionSmall Adaptive Set Intersection

Query compilerQuery compiler– One cursor on posting lists for each

node– CursorWord, CursorAnd, CursorOr,

CursorPhraseQueryCursor.next(Result& min)QueryCursor.next(Result& min)

– Returns first result r >= minSingle operator for all kind of Single operator for all kind of

queries: e.g. proximityqueries: e.g. proximity

SASI exampleSASI example

world wide web

3

9

12

20

40

47

1

8

10

40

41

2

4

6

21

40

PerformancePerformance

Comparison (single node)Comparison (single node)

IndexingIndexing

TimeTime

SearchSearch

SpeedSpeed ¹

ProgramProgrammabilitymability

ExcerptsExcerpts ProximityProximity

RankRank

RankingRanking

IXEIXE 2 GB/h2 GB/h 30 q/s30 q/s C++ APIC++ API Link Link popularitypopularity

FulcrumFulcrum 0.7 GB/h0.7 GB/h 6 q/s6 q/s C APIC API nono nono nono

GoogleGoogle ?? 1 q/s1 q/s C, C, pythonpython

PageRankPageRank

FastFast 1-2 GB/h1-2 GB/h 3 q/s3 q/s CC

plannedplanned FirstPageFirstPage

ShareShare

PointPoint

0.2 GB/h0.2 GB/h 3 q/s3 q/s C++C++ nono ?? ??

VerityVerity 0.2 GB/h0.2 GB/h 4 q/s4 q/s nono ?? ??

¹ 2 million documents

Comparison (2)Comparison (2)

Paragraph Paragraph indexingindexing

ColorColor

SearchSearch

ColumnColumn

SearchSearch

Max docMax doc

sizesize

O.S.O.S.

IXEIXE no limitno limit

Linux, Linux, Windows, Windows,

Alpha, Alpha, SolarisSolaris

FulcrumFulcrum nono limitedlimited 64 K64 KWindows, Windows,

Linux, Linux, SolarisSolaris

GoogleGoogle nono ??limitedlimited

4 K4 K LinuxLinux

FastFast nono ?? ?? ?? NetBSDNetBSD

ShareShare

PointPointnono nono nono ?? WindowsWindows

An independent benchmarkAn independent benchmark

0,00

50,00

100,00

150,00

200,00

250,00

Indexing (Intel) Retrieval (Intel)

AltaVistaIXE

0,00

50,00

100,00

150,00

200,00

250,00

Indexing (Intel) Retrieval (Intel)

AltaVistaIXE

Independent evaluationsIndependent evaluations

Major portal, GermanyMajor portal, GermanyMajor portal, FranceMajor portal, FranceMajor portal, ItalyMajor portal, Italy

– Stress test with 300 concurrent queries– Verity crashed in several cases

Microsoft RedmondMicrosoft Redmond

IXE in useIXE in use

JanasJanas– 150 Million documents– 50 Million documents per server:

• Pentium III, 1 GHz, 2 GB RAM, 2x75 GB IDE

– Italy: 3 PCs, 300 K queries/dayKataWebKataWeb

– largest Italian Web portal– 4 GB documents– 2nd largest Italian newspaper

Other FeaturesOther Features

SnippetsSnippetsDocument cacheDocument cacheColorsColorsMultiple collectionsMultiple collections

– Sorted by page rank– Authoritativeness– Popularity

Filter/Group by similarityFilter/Group by similarityConceptual ClusteringConceptual Clustering

SnippetsSnippets

Adaptive algorithm:Adaptive algorithm:– Compiled regular expression search for

few words– Karp-Rabin algorithm for several words

Customizable on length of snippets, Customizable on length of snippets, proximity of hits, etc.proximity of hits, etc.

Programmable Query SyntaxProgrammable Query Syntax

Typical Search OptionsTypical Search Options– By document type (e.g. HTML, PDF,

DOC)– By color (e.g. title, author)– Within site or domain (through prefix

search on URL)

Result RankingResult Ranking

Based on combination of measuresBased on combination of measures– Classical IR– Authoritativeness– Link popularity– Prioritized collections

Clients can provide their own criteriaClients can provide their own criteria– Pay for placement– Adult filter– Freshness, etc.

Ranking MeasuresRanking Measures

IR rankIR rank– Based on frequencies (tf, idf)– cosine, Okapi (Robertson), Amati

Best Trec10 score: 0,22% relevanceBest Trec10 score: 0,22% relevance IXE uses simplified cosine with IXE uses simplified cosine with

additional scoring factors:additional scoring factors:– Colors (presence in title, heading, etc.)– Proximity for multiple words– Capitalization/font possible (Google)

Authoritative scoreAuthoritative score

Link popularityLink popularity– Based on incoming link count

Reference from authoritative site Reference from authoritative site (e.g. Dmoz)(e.g. Dmoz)– Increase document rank– Descriptions from Dmoz are added to

document with special colorCitations (i.e. text surrounding link)Citations (i.e. text surrounding link)

– Added to document with special color

Priority rankPriority rank

Documents are arranged in several Documents are arranged in several collectionscollections

Collections are searched in orderCollections are searched in orderEarlier collections contain higher Earlier collections contain higher

rank documentsrank documentsTunable cutoff at 4000 documentsTunable cutoff at 4000 documentsStatistical estimate of overall number Statistical estimate of overall number

of resultsof results

Custom rankCustom rank

IR rank is computed from data in IR rank is computed from data in lexicon (word based)lexicon (word based)

Cosine, authoritativeness, custom Cosine, authoritativeness, custom rank are document relatedrank are document related

Accessing document data during Accessing document data during search is a drag in performancesearch is a drag in performance

Solution: associate direct access Solution: associate direct access info (mmapped)info (mmapped)

Nested ObjectsNested Objects

class WebInfo : public DocInfo {class WebInfo : public DocInfo {

CompressedText<65535>CompressedText<65535> text;text;

RankWeightRankWeight weights;weights;

META(WebInfo,META(WebInfo,

(SUPERCLASS(DocInfo),(SUPERCLASS(DocInfo),

FIELD(text),FIELD(text),

KEY(weights, mapped)));KEY(weights, mapped)));

};};

Custom Rank Nested ObjectCustom Rank Nested Object

Struct RankWeightStruct RankWeight

{{int importance,int popularity,int freshness,int adult,…

};};

ScalabilityScalability

Distributed IndexingDistributed Indexing– Performed on spidering machines– Merged indexes

Server farm of cheap PCsServer farm of cheap PCs– 1.2 GHz Athlon or Pentium– 2 GB RAM– 2 x 75 GB disks

12 h indexing cycle for 50 million 12 h indexing cycle for 50 million documents on 8 PCsdocuments on 8 PCs

Query processingQuery processing

Query brokerQuery broker– Dispatches query– Merge sort of results– Maintains cache of results

IFIFLL (Local Inverted File Partition) (Local Inverted File Partition)

Distributed CrawlerDistributed Crawler

Distributed CrawlerDistributed Crawler

High performanceHigh performance– ~120 pages/sec on single node

ScalableScalableFault tolerantFault tolerantCollects data for link popularity, Collects data for link popularity,

citationscitationsHandles several documents formatsHandles several documents formats

Crawler ArchitectureCrawler Architecture

Retriever Crawler

Parser

Scheduler Retriever

Retriever

Cache

CrawlInfo

select()

Table <UrlInfo>

Citations

Hosts Robots

Host queues

Web Service SupportWeb Service Support

C# integrationC# integration

Managed code indexer DLLManaged code indexer DLLManaged objects for controlling Managed objects for controlling

indexing:indexing:– CollectionInfo– Gatherer– Gathered

WebForm GUIWebForm GUI

GUI ArchitectureGUI Architecture

GUIControl

GUIControl

CollectionInfo

CollectionInfo

GathererGatherer

GUIControl

GUIControl

table<Gathered>

Collection BuilderCollection Builder

*.coll

Serialize

copycache

URL:ID:time:size:MD5:lastSeen:

Gathered

ConverterConverter name:time:size:

WebInfo

CollectionEnumeratorCollection

Enumerator

table<WebInfo>

WebIndexerWebIndexer

UnManaged

Web Search ServiceWeb Search Service

High performance High performance search engine search engine librarylibrary

C++ template C++ template librarylibrary

Handles Terabyte Handles Terabyte of dataof data

Available as Web Available as Web ServiceService

IIndendeXXing ing EEnginengine

Recommended