Next generation search engines Paolo Ferragina Dipartimento di Informatica, Pisa

Next generation search engines

Paolo FerraginaDipartimento di Informatica, Pisa

Our journey today!

Websearch engines

XMLsearch engines

Basic Researchon data compression,indexing and mining

More than 85% users arrive to a site from a SE

Web Searches: 45% Google, 29% Yahoo, 13% MSN, 5% ASK,... Toolbar searches: 49.6% Google, 46.1% Yahoo,...

SE impact onto: Web structure, knowledge and

understanding, social behavior.... and marketing

33% users believe that “the results of a query are the

best place where to buy things” !!

Ads (4B$ in USA, 2B€ in Europe, 180M€ in Italy) Paid search: 65% Google, 25% Yahoo, 8% MSN,... Portal search: 15% Yahoo, 10% MSN, 7% AOL-Google,...

Much interest...

Retrieve the docs that are “relevant” for the user query

Doc: file word or pdf, web page, email, blog, e-

book,... Query: paradigm “bag of words”

Relevant ?!?

...We face many difficulties, especially on the

Web!!!

Goal of a Search Engine

Web is huge: 8 bil pages [Google]

We need to “rank” the results !!

Languages/Encodings Hundreds of languages: 55 (Jul01) Home pages:

In 1997: English 82%, the next 15 take 13% In 2001: English 53%, the next 9 take 30%

Distributed authorship Millions of people creating pages with their own style… Not all have the purest motives in providing high-quality

information - commercial motives drive “spamming”.

Web is heterogeneous

Extracting “significant data” is difficult !!

Web is highly dynamic [154 sites, 2004]

A “good” coverage of the indexed Web is

difficult !!

Normalizedwrt first week

User Queries are “difficult”

Query composition: Short

2001: 2.54 terms avg

80% less than 3 terms

Imprecise terms

78% of the queries are not modified

Query results: Users are lazy: 85% look at just one page of results

User Needs are “variegate”

Informational – want to learn about something (~40%)

Navigational – want to go to a page (~25%)

Transactional – want to do something (~35%)

Access a service Downloads Shop

Asthma

Alitalia

NY weatherMars surface images

Nikon CoolPix

Evolution of Search Engines First generation -- use only on-page, web-text data

Word frequency and language

Second generation -- use off-page, web-graph data Link (or connectivity) analysis Anchor-text (How people refer to a page)

Third generation -- answer “the need behind the query” Focus on “user need”, rather than on query Integrate multiple data-sources Click-through data Query mining

1995-1997 AltaVista, Excite, Lycos, etc

1998: Google, now everyone

No winner yet !!

Various players: Google, Yahoo, Msn, Ask,…

Fourth generation Information Supply[Andrei Broder, VP emerging search tech, Yahoo! Research]

Yesterday.....

...Today

Yesterday...

...Today

All these toolsare built upon aSearch Engine

Structure of (Web) Search Engines

Crawler

Page archive

PageAnalizer

Control

Queryresolver

Ranker

Indexing data structures

Indexer

Size of search engines [2005]

Google vs Yahoo: 20-30% sharing of results

Ranking: Google vs Yahoo!

Ranking: Google.com - Google.cn

Clustering engines Vivisimo, Snaket,...

Suggestions

Products

Local searches

News, Blogs, ....

Not only Web Searches...

Web search and mining

Yahoo! World Search

Yahoo! Image, Yahoo! Video, Yahoo! Local, Yahoo! News, Yahoo! Shopping Search,

Communication Yahoo! Mail, Yahoo! Messenger, My Web, Yahoo! Personals, Yahoo! 360º, Yahoo! Photos, Flickr, delicious, ... Yahoo! Answers

Content: Yahoo! Sports, Yahoo! Finance, Yahoo! Music, Yahoo! Movies, Yahoo! News, Yahoo! Games. My Yahoo!

Mobile: Yahoo! Mobile Yahoo! Go

Commerce: Yahoo! Shopping, Yahoo! Autos, Yahoo! Auctions, Yahoo! Travel,

Small Business: Yahoo! Small Business Yahoo! Domains, Yahoo! Web Hosting, Yahoo! Merchant Solutions, Yahoo! Business Email, HotJobs

Advertising: Yahoo! Search Marketing Yahoo! Publisher Network.

[source: R. Baeza-Yates]

Yahoo! numbers [April, ‘06]

15 languages, 20 countries, 6B users

Each day: 1 million new accounts 3.4 billion page views 10 Tb of data processed (total, 20Pb) 2 billion Mail+Messenger sent

Yahoo! Research Barcelona

Starting date: May 2006, Barcelona Director: Ricardo Baeza-Yates Areas: Web Mining and Web Search People: more than 10 and… fast growing !!

Why me ? First academic grant in Europe Three years project on “Data compression and

indexing on hierarchical memories”

Data to be mined or searched

Crawled data (large, heterogeneous, …) Web Pages & Links Blogs Items for sale: Shopping, Travel, etc. RSS Feeds

Produced data (high quality, sparse,…) Yahoo’s Web: YCars, YHealth, Ytravel,… Edited news, purchased news,…

Direct interaction (quality??) Social links Tagged content

What is Flickr ?

The wisdom of the crowd can

be used to improve thesearch and extraction

process

Observed data

Query Logs spelling, synonyms, phrases (named entities),

substitutions,…

Clicks relevance, intent, …

“There is a new type of economics that has emerged and that the world doesn't understand,”

“Web usage data is an amazing leading indicator because it tells you where intent is heading”

U. Fayyad, Yahoo Chief Data Officer

Our future goals…

Deploy user actions, e.g. queries + clicks + …

Implicit semantic information

It's free and unbiased

Large volume

… the Semantic Web Hypothesis - Explicit Semantic Information

Obstacle - Us

Possible uses:• Query suggestion• Query disambiguation• Adv suggestions• Web-site design...

XML search and mining

An XML excerpt

<author> Donald E. Knuth </author><title> The TeXbook </title><publisher> Addison-Wesley </publisher><year> 1986 </year>

</book> <article>

<author> Donald E. Knuth </author><author> Ronald W. Moore </author><title> An Analysis of Alpha-Beta Pruning </title><pages> 293-326 </pages><year> 1975 </year><volume> 6 </volume><journal> Artificial Intelligence </journal>

</article>

...</dblp>

The literature on XML indexing...

Various tools are available TreSy [Cribecu, 1997] eXist [TU Darmstadt, 2002] GalaTex [AT&T, 2004]

Some of their limitations Run on a single machine Use a lot of computational resources (time, space,…) Limit the indexable XML document structure

XML document types data centric [relational data: DB exports] text centric [literary texts, reports, emails, news, …]

Application Level

Our proposal:

Query interface

• XML based

Query solver

• analysis + optimization

Result retriever

• indexing data structure Data Collection manager

• data compression • snippet extraction

The first scenario: Client-Server

Context of use : Biblio search,...

The second scenario: Peer-to-Peer

Context of use: Collaborative search

Exploit the power of the crowd The largest library of XML tagged text

collections

…and the power of search engines A suite of search + text mining tools

Syntactic text comparison Motifs extraction for text pattern identification Concept identification via LSI

Our goal...

You find already loaded

rare texts

in editions and translations

coming from ‘400 and ‘500

My documents 5

You may compose sophisticated queries

you can

visually compose

sophisticated

structural queries

http://signum.sns.it

Everything on the finger tips of

humanists Nokia 770, Origami (Microsoft ), SmartPhones,

Stay in touch...

Basic research Recurrent themes of this talk

Large volume of data Efficient search

Hierarchical memory systems: L1-L2 caches, RAM,

(Multi-) Disks, (Web) Network, …

Basic algorithmic tools

Indexing data structures

Data compression

Do we face a paradoxical situation ?

Six years ago... [now, J. ACM 05]

Opportunistic Data Structures with Applications

P. Ferragina, G. Manzini

Survey by Navarro-Makinen cites more than 50 papers on the subject !!

[December 2003] [January 2005]

Joint effort with Navarro’s group at Univ. Chile

Some figures over hundreds of MBs of data:• Count(P) takes few millisecs

• Locate(P) takes few millisecs for each occurrence of P• Space is about [bzip ~ 20%]

• 22% (support just Count ops)• 35% (Count, Locate ops)

Compressed index for XML [Ferragina et al, WWW ’06]

Query (counting) time 8 ms, Navigation time 3 ms

DBLP Pathways News

Huffword XPress XQzip XBzipIndex XBzip

UniPi is

patenting it !!

Next generation search enginesPaolo Ferragina

University of Pisa

Thanks !!

An XML excerpt<dblp> <book>

<author> Donald E. Knuth </author><title> The TeXbook </title><publisher> Addison-Wesley </publisher><year> 1986 </year>

</book> <article>

<author> Donald E. Knuth </author><author> Ronald W. Moore </author><title> An Analysis of Alpha-Beta Pruning </title><pages> 293-326 </pages><year> 1975 </year><volume> 6 </volume><journal> Artificial Intelligence </journal>

</article>

...</dblp>

It is verbose !

A tree interpretation...

XML document exploration Tree navigation XML document search Labeled subpath

searches

Subset of XPath [W3C]

The Problem

Summary indexes (like Dataguide, 1-index or 2-index) large space and do not support “content” searches

XML-aware compressors (like XMill, XmlPpm, ScmPpm,...) need the whole decompression

We wish to devise a compressed representation for a labeled tree T that efficiently supports some operations:

Navigational operations Subpath and content searches Visualization operation

XML-queriable compressors (like XPress, XGrind, XQzip,...) poor compression and scan of the whole (compressed) file

XML-native search engines

might exploit this tool as a core block for

query optimization and (compressed) storage

A transform for “labeled trees” [Ferragina et al, IEEE Focs ’05]

We propose the XBW-transform that linearizes

a labeled tree T in 2 arrays such that:

the compression of T reduces to the compression of these two arrays (via e.g. gzip, bzip2, ppm,...)

the indexing of T reduces to implement simple rank/select query operations over these two arrays

A = a b a a a c b c d a b e c d ...

Rank( a , 7 ) = #a in A[1,7] = 4Select( a , 2 ) = pos 2° a = 3

The XBW-TransformC

CBDcacAb aDcBDba

CB CD B CD B CB CCA CA CA CD A CCB CD B CB C

upward labeled paths

Permutationof tree nodes

Step 1.Visit the tree in pre-order. For each node, write down its label and the labels on its upward path

The XBW-TransformC

CbaDDc DaBABccab

A CA CA CB CB CB CB C CCCD A CD B CD B CD B C

upward labeled paths

Step 2.Stably sort according to S

XBW takes optimal space

1001010 10011011

The XBW-TransformC

CbaDDc DaBABccab

A CA CA CB CB CB CB C CCCD A CD B CD B CD B C

Step 3.Add a binary array Slast marking the

rows corresponding to last children

XBW can be built and inverted

in optimal time

An illustrative example

Pcdata

Tags, Attributes and the symbol =

XBW is compressible:

S and Spcdata are locally homogeneous

Slast has some structure

XBzip = XBW + PPMd [Ferragina et al, WWW

’06]

DBLP Pathways News

gzip bzip2 ppmdi xmill + ppmdi scmppm XBzip

A general algorithmic paradigm Basic approach (…now only for text and labelled trees)

Transform the input data in few arrays Index (+compress) to support Rank/Select

Theory: Soda ’06 (2), Cpm ’06 (2), Icalp ’06 (2), DCC ’06 (1)

Experimental: Wea ’06 (2)

A lot of interest around it:

http://pizzachili.di.unipi.it or http://pizzachili.dcc.uchile.cl

You can test it:

A general algorithmic paradigm Basic (magic ?!?) approach

Transform the input data in few arrays Index (+compress) them to support Rank/Select ops

Theory: Soda ’06 (2), Cpm ’06 (2), Icalp ’06 (2), DCC ’06 (1)

Experimental: Wea ’06 (2)

A lot of interest around it:

A = a b a a a c b c d a b e c d ...

Rank( a , 7 ) = #a in A[1,7] = 4Select( a , 2 ) = pos 2° a = 3

Next generation search engines Paolo Ferragina Dipartimento di Informatica, Pisa

Documents

An Abstract Component Model Egon Börger Dipartimento di Informatica, Universita di Pisa boerger

Lempel-Ziv Algorithms - Dipartimento di Informaticapages.di.unipi.it/ferragina/Teach/InformationRetrieval/3-Lecture.pdf · Lempel-Ziv Algorithms ... Prof. Paolo Ferragina, Algoritmi

Index Construction: sorting Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading Chap 4

Natural Language Processing Giuseppe Attardi Dipartimento di Informatica Università di Pisa

Search Engines & Question Answering Giuseppe Attardi Dipartimento di Informatica Università di Pisa

Basics of Data Compression Paolo Ferragina Dipartimento di Informatica Università di Pisa

Paolo ferragina università di pisa

Paolo Ferragina, Università di Pisa Compressed Rank & Select on general strings Paolo Ferragina Dipartimento di Informatica, Università di Pisa

Paolo Ferragina, Università di Pisa On Compression and Indexing: two sides of the same coin Paolo Ferragina Dipartimento di Informatica, Università di

Index construction: Compression of documents Paolo Ferragina Dipartimento di Informatica Universit di Pisa Reading Managing-Gigabytes: pg 21-36, 52-56,

Modeling Community with Tiles joint work with Ivan Lanese Ugo Montanari Dipartimento di Informatica Università di Pisa Roberto Bruni Dipartimento di Informatica

Synchronized Hyperedge Replacement for Heterogeneous Systems joint work with Emilio Tuosto Dipartimento di Informatica Università di Pisa Ivan Lanese Dipartimento

Perspectives on Language and Intelligence Giuseppe Attardi Dipartimento di Informatica Università di Pisa

Random access to arrays of variable-length items Paolo Ferragina Dipartimento di Informatica Università di Pisa

Advanced Programming Giuseppe Attardi Dipartimento di Informatica Università di Pisa

Dipartimento di Fisica “Enrico Fermi”, Università di ...renormalization.com/pdf/08A2.pdf · Dipartimento di Fisica “Enrico Fermi”, Università di Pisa, Largo Pontecorvo 3,

The tsunami of Deep Learning over NLP Giuseppe Attardi Dipartimento di Informatica Università di Pisa Pisa, December 15, 2015

Università degli Studi di Pisa Giuseppe Iannaccone G. Iannaccone Dipartimento di Ingegneria dell’Informazione, Università degli Studi di Pisa Via Diotisalvi

IR Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading Chapter 1 Many slides are revisited from Stanford’s lectures by P.R

Paolo Ferragina, Università di Pisa XML Compression and Indexing Paolo Ferragina Dipartimento di Informatica, Università di Pisa [Joint with F. Luccio,