To search or to crawl?: Towards a query optimizer for text-centric tasks

To search or to crawl?: Towards a query optimizer

for text-centric tasks

Presented byAvinash S Bharadwaj

How can data be extracted from the web?Execution plans for text-centric tasks follow

two general paradigms for processing a text database:The entire web can be crawled or scanned for

the text automaticallySearch engine indexes can be used to retrieve

the documents of interest using carefully constructed queries depending on the task.

IntroductionText is ubiquitous and many applications rely

on the text present in web pages for performing a variety of tasks.

Examples of text centric tasksReputation management systems download

web pages to track the buzz around the companies.

Comparative shopping agents locate e-commerce web sites and add the products offered in the pages to their own index.

Examples of text centric tasksAccording to the paper there are mainly

three types of text centric tasksTask 1: Information ExtractionTask 2: Content Summary ConstructionTask 3: Focused Resource Discovery

5

Task 1: Information ExtractionInformation extraction applications extract

structured relations from unstructured text

May 19 1995, Atlanta -- The Centers for Disease Control and Prevention, which is in the front line of the world's response to the deadly Ebola epidemic in Zaire , is finding itself hard pressed to cope with the crisis…

Date Disease Name LocationJan. 1995 Malaria EthiopiaJuly 1995 Mad Cow Disease U.K.

Feb. 1995 Pneumonia U.S.May 1995 Ebola Zaire

Information Extraction System (e.g., NYU’s Proteus)

Disease Outbreaks in The New York Times

Information Extraction tutorial yesterday by AnHai Doan, Raghu Ramakrishnan, Shivakumar Vaithyanathan

Task 2: Content Summary construction

Many text databases have valuable contents “hidden” behind search interfaces.

Metasearchers are used to search over multiple databases using a unified query interface.

Generation of content summary.

Task 3: Focused Resource DiscoveryThis task considers building applications based on a

particular resource.Simplest approach is to crawl the entire web and classify

the web pages accordinglyMuch more efficient approach is to use a focused crawler.The focused crawlers depend documents and hyperlinks

that are on-topic, or likely to lead to on-topic documents, as determined by a number of heuristics.

8

An Abstract View of Text-Centric TasksOutput Tokens

…ExtractionSystem

Text Database

3. Extract output tokens

2. Process documents

1. Retrieve documents from database

Task TokenInformation Extraction Relation Tuple

Database Selection Word (+Frequency)

Focused Crawling Web Page about a Topic

Execution StrategiesThe paper describes four execution

strategies. ScanFiltered ScanIterative Set ExpansionAutomatic Query Generation

Crawl Based

Query or Index Based

Execution Strategy: ScanScan methodology processes each document in a

database exhaustively until the number of tokens extracted satisfies the target recall.

The Scan execution strategy does not need training and does not send any queries to the database.

Execution time = |Retrieved Docs| · (R + P)

Prioritizing the documents based may help in improving efficiency.

Time for retrieving a document

Time for processing a

document

Execution Strategy: Filtered ScanFiltered scan is an improvement over the

basic scan methodology.Unlike scan filtered scan uses a classifier for

a specific task to check whether the article contributes at least one token before parsing the article.

Execution time = |Retrieved Docs| · (R + P + C) Time for

retrieving a document


document

Time for classifying a

document

Execution Strategy: Iterative Set Expansion

Output Tokens…Extraction

System

Text Database

3. Extract tokensfrom docs

2. Process retrieved documents

1. Query database with seed tokens

Query Generation

4. Augment seed tokens with new tokens(e.g., [Ebola AND

Zaire])

(e.g., <Malaria, Ethiopia>)

Execution Strategy: Iterative Set Expansion contd…

Iterative Set Expansion has been successfully applied in many tasks.

Execution time = |Retrieved Docs| * (R + P) + |Queries| * Q

Time for retrieving a document

Time for answering a

query


document

Execution Strategy: Automatic Query Generation Iterative Set Expansion has recall limitation due to

iterative nature of query generation Automatic Query Generation avoids this problem

by creating queries offline (using machine learning), which are designed to return documents with tokens.

Automatic Query Generation works in two stages:In the first stage, Automatic Query Generation trains

a classifier to categorize documents as useful or not for the task

In the execution stage, Automatic Query Generation searches a database using queries that are expected to retrieve useful documents.

Estimating Execution plan costs: ScanModeling Scan for Token t: What is the probability of seeing t

(with frequency g(t)) after retrieving S documents?

A “sampling without replacement” process

After retrieving S documents, frequency of token t follows hypergeometric distribution

Recall for token t is the probability that frequency of t in S docs > 0

t

d1

d2

dS

dN

...

D

Token

Samplingfor t

...

<SARS, China>

Probability of seeing token t after retrieving S documentsg(t) = frequency of token t

Estimating Execution plan costs: Iterative Set Expansion

The querying graph is a bipartite graph, containing tokens and documents

Each token (transformed to a keyword query) retrieves documents

Documents contain tokens

Tokens Documentst1

t2

t3

t4

t5

d1

d2

d3

d4

d5

<SARS, China>

<Ebola, Zaire>

<Malaria, Ethiopia>

<Cholera, Sudan>

<H5N1, Vietnam>

17

Estimating execution plan costs: Iterative Set Expansion contd…

We need to compute the: Number of documents retrieved

after sending Q tokens as queries (estimates time)

Number of tokens that appear in the retrieved documents (estimates recall)

To estimate these we need to compute the:

Degree distribution of the tokens discovered by retrieving documents

Degree distribution of the documents retrieved by the tokens

(Not the same as the degree distribution of a randomly chosen token or document – it is easier to discover documents and tokens with high degrees)

Tokens Documentst1

t2

t3

t4

t5

d1

d2

d3

d4

d5

<SARS, China>

<Ebola, Zaire>

<Malaria, Ethiopia>

<Cholera, Sudan>

<H5N1, Vietnam>

Experimental Results

100

1,000

10,000

100,000

0.00 0.10 0.20 0.30 0.40 0.50 0.60 0.70 0.80 0.90 1.00Recall

Exe

cutio

n Ti

me

(sec

s)

ScanFilt. ScanAutomatic Query Gen.Iterative Set Expansion

100

1,000

10,000

100,000

0.00 0.10 0.20 0.30 0.40 0.50 0.60 0.70 0.80 0.90 1.00Recall

Exe

cutio

n Ti

me

(sec

s)

Scan

Filt. Scan

Iterative Set Expansion

Automatic Query Gen.

OPTIMIZED

Experimental Results contd…

20

ConclusionsCommon execution plans for multiple text-

centric tasks

Analytic models for predicting execution time and recall of various crawl- and query-based plans

Techniques for on-the-fly parameter estimation

Optimization framework picks on-the-fly the fastest plan for target recall

Thank you!