Quality of a search engine Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 8

Preview:

Citation preview

Quality of a search engine

Paolo FerraginaDipartimento di Informatica

Università di Pisa

Reading 8

Is it good ?

How fast does it index Number of documents/hour (Average document size)

How fast does it search Latency as a function of index size

Expressiveness of the query language

Measures for a search engine

All of the preceding criteria are measurable

The key measure: user happiness…useless answers won’t make a user happy

Happiness: elusive to measure

Commonest approach is given by the relevance of search results How do we measure it ?

Requires 3 elements:1. A benchmark document collection2. A benchmark suite of queries3. A binary assessment of either Relevant or

Irrelevant for each query-doc pair

Evaluating an IR system

Standard benchmarks TREC: National Institute of Standards and

Testing (NIST) has run large IR testbed for

many years

Other doc collections: marked by human

experts, for each query and for each doc,

Relevant or Irrelevant

On the Web everything is more complicated since we cannot mark the entire corpus !!

General scenario

Relevant

Retrieved

collection

Precision: % docs retrieved that are relevant [issue “junk” found]

Precision vs. Recall

Relevant

Retrieved

collection

Recall: % docs relevant that are retrieved [issue “info” found]

How to compute them

Precision: fraction of retrieved docs that are relevant Recall: fraction of relevant docs that are retrieved

Precision P = tp/(tp + fp) Recall R = tp/(tp + fn)

Relevant Not Relevant

Retrieved tp (true positive) fp (false positive)

Not Retrieved

fn (false negative) tn (true negative)

Some considerations

Can get high recall (but low precision) by retrieving all docs for all queries!

Recall is a non-decreasing function of the number of docs retrieved

Precision usually decreases

Precision-Recall curve

We measures Precision at various levels of Recall Note: it is an AVERAGE over many queries

precision

recall

x

x

x

x

A common picture

precision

recall

x

x

x

x

F measure

Combined measure (weighted harmonic mean):

People usually use balanced F1 measure

i.e., with = ½ thus 1/F = ½ (1/P + 1/R)

Use this if you need to optimize a single measure

that balances precision and recall.

RP

F1)1(

11

Recommendation systems

Paolo FerraginaDipartimento di Informatica

Università di Pisa

Recommendations

We have a list of restaurants with and ratings for some

Which restaurant(s) should I recommend to Dave?

Brahma Bull Spaghetti House Mango Il Fornaio Zao Ming's Ramona's Straits Homma'sAlice Yes No Yes NoBob Yes No No

Cindy Yes No NoDave No No Yes Yes YesEstie No Yes Yes YesFred No No

Basic Algorithm

Recommend the most popular restaurants say # positive votes minus # negative votes

What if Dave does not like Spaghetti?

Brahma Bull Spaghetti House Mango Il Fornaio Zao Ming's Ramona's Straits Homma'sAlice 1 -1 1 -1Bob 1 -1 -1

Cindy 1 -1 -1Dave -1 -1 1 1 1Estie -1 1 1 1Fred -1 -1

Smart Algorithm

Basic idea: find the person “most similar” to Dave according to cosine-similarity (i.e. Estie), and then recommend something this person likes.

Perhaps recommend Straits Cafe to Dave

Brahma Bull Spaghetti House Mango Il Fornaio Zao Ming's Ramona's Straits Homma'sAlice 1 -1 1 -1Bob 1 -1 -1

Cindy 1 -1 -1Dave -1 -1 1 1 1Estie -1 1 1 1Fred -1 -1

Do you want to rely on one person’s opinions?

Main idea

U

V

W

d1

d2

d5

d3

d4

d6

Y d7

What do we suggest to U ?

A glimpse on XML retrieval(eXtensible Markup Language)

Paolo FerraginaDipartimento di Informatica

Università di Pisa

Reading 10

XML vs HTML

HTML is a markup language for a specific purpose (display in browsers) XML is a framework for defining markup

languages

HTML has fixed markup tags, XML no

HTML can be formalized as an XML language (XHTML)

XML Example (visual)

XML Example (textual)

<chapter id="cmds"> <chaptitle> FileCab </chaptitle> <para>This chapter describes the

commands that manage the <tm>FileCab</tm>inet application.

</para> </chapter>

Basic Structure

An XML doc is an ordered, labeled tree

character data: leaf nodes contain the actual data (text strings)

element nodes: each labeled with a name (often called the element type), and a set of attributes, each consisting of a

name and a value, can have child nodes

XML: Design Goals

Separate syntax from semantics to provide a framework for structuring information

Allow tailor-made markup for any imaginable application domain

Support internationalization (Unicode) and platform independence

Be the standard of (semi)structured information (do some of the work now done by databases)

Why Use XML?

Represent semi-structured

XML is more flexible than DBs

XML is more structured than simple IR

You get a massive infrastructure for free

Data vs. Text-centric XML

Data-centric XML: used for messaging between enterprise applications Mainly a recasting of relational data

Text-centric XML: used for annotating content Rich in text Demands good integration of text retrieval

functionality E.g., find me the ISBN #s of Books with at least

three Chapters discussing cocoa production, ranked by Price

IR Challenges in XML

There is no document unit in XML How do we compute tf and idf? Indexing granularity Need to go to document for retrieving or

displaying a fragment E.g., give me the Abstracts of Papers on

existentialism

Need to identify similar elements in different schemas Example: employee

Xquery: SQL for XML ? Simple attribute/value

/play/title contains “hamlet”

Path queries title contains “hamlet” /play//title contains “hamlet”

Complex graphs Employees with two managers

What about relevance ranking?

Data structures for XML retrieval

Inverted index: give me all elements matching text query Q We know how to do this – treat each

element as a document

Give me all elements below any instance of the Book element (Parent/child relationship is not enough)

Positional containment

Doc:1

27 1122 2033 5790Play

431 867Verse

Term:droppeth720

droppeth under Verse under Play.

Containment can beviewed as mergingpostings.

Summary of data structures

Path containment etc. can essentially be solved by positional inverted indexes

Retrieval consists of “merging” postings

All the compression tricks are still applicable

Complications arise from insertion/deletion of elements, text within elements Beyond the scope of this course

Search Engines

Advertising

Classic approach…

Socio-demo Geographic Contextual

Search Engines vs Advertisement First generation -- use only on-page, web-text data

Word frequency and language

Second generation -- use off-page, web-graph data Link (or connectivity) analysis Anchor-text (How people refer to a page)

Third generation -- answer “the need behind the query” Focus on “user need”, rather than on query Integrate multiple data-sources Click-through data

Pure search vs Paid search

Ads show on search (who pays more), Goto/Overture

2003 Google/YahooNew model

All players now have:SE, Adv platform + network

The new scenario

SEs make possible aggregation of interests unlimited selection (Amazon, Netflix,...)

Incentives for specialized niche players

The biggest money is in the smallest sales !!

Two new approaches

Sponsored search: Ads driven by search keywords

(and user-profile issuing them)

AdWords

-$

+$

Two new approaches

Sponsored search: Ads driven by search keywords

(and user-profile issuing them)

Context match: Ads driven by the content of a web page

(and user-profile reaching that page)

AdWords

AdSense

How does it work ?

1) Match Ads to query or pg content2) Order the Ads3) Pricing on a click-through

IR

Econ

Visited Pages

Clicked Banner

Web Searches

Clicks on Search Results

Web usage data !!!

Dictionary problem

A new game

For advertisers: What words to buy, how much to pay SPAM is an economic activity

For search engines owners: How to price the words Find the right Ad Keyword suggestion, geo-coding, business

control, language restriction, proper Ad display

Similar to web searching, but:Ad-DB is smaller, Ad-items are

small pages, ranking depends on clicks

Recommended