Upload
chava
View
46
Download
0
Tags:
Embed Size (px)
DESCRIPTION
To search or to crawl?: Towards a query optimizer for text-centric tasks. Presented by Avinash S Bharadwaj. How can data be extracted from the web?. Execution plans for text-centric tasks follow two general paradigms for processing a text database : - PowerPoint PPT Presentation
Citation preview
To search or to crawl?: Towards a query optimizer
for text-centric tasks
Presented byAvinash S Bharadwaj
How can data be extracted from the web?Execution plans for text-centric tasks follow
two general paradigms for processing a text database:The entire web can be crawled or scanned for
the text automaticallySearch engine indexes can be used to retrieve
the documents of interest using carefully constructed queries depending on the task.
IntroductionText is ubiquitous and many applications rely
on the text present in web pages for performing a variety of tasks.
Examples of text centric tasksReputation management systems download
web pages to track the buzz around the companies.
Comparative shopping agents locate e-commerce web sites and add the products offered in the pages to their own index.
Examples of text centric tasksAccording to the paper there are mainly
three types of text centric tasksTask 1: Information ExtractionTask 2: Content Summary ConstructionTask 3: Focused Resource Discovery
5
Task 1: Information ExtractionInformation extraction applications extract
structured relations from unstructured text
May 19 1995, Atlanta -- The Centers for Disease Control and Prevention, which is in the front line of the world's response to the deadly Ebola epidemic in Zaire , is finding itself hard pressed to cope with the crisis…
Date Disease Name LocationJan. 1995 Malaria EthiopiaJuly 1995 Mad Cow Disease U.K.
Feb. 1995 Pneumonia U.S.May 1995 Ebola Zaire
Information Extraction System (e.g., NYU’s Proteus)
Disease Outbreaks in The New York Times
Information Extraction tutorial yesterday by AnHai Doan, Raghu Ramakrishnan, Shivakumar Vaithyanathan
Task 2: Content Summary construction
Many text databases have valuable contents “hidden” behind search interfaces.
Metasearchers are used to search over multiple databases using a unified query interface.
Generation of content summary.
Task 3: Focused Resource DiscoveryThis task considers building applications based on a
particular resource.Simplest approach is to crawl the entire web and classify
the web pages accordinglyMuch more efficient approach is to use a focused crawler.The focused crawlers depend documents and hyperlinks
that are on-topic, or likely to lead to on-topic documents, as determined by a number of heuristics.
8
An Abstract View of Text-Centric TasksOutput Tokens
…ExtractionSystem
Text Database
3. Extract output tokens
2. Process documents
1. Retrieve documents from database
Task TokenInformation Extraction Relation Tuple
Database Selection Word (+Frequency)
Focused Crawling Web Page about a Topic
Execution StrategiesThe paper describes four execution
strategies. ScanFiltered ScanIterative Set ExpansionAutomatic Query Generation
Crawl Based
Query or Index Based
Execution Strategy: ScanScan methodology processes each document in a
database exhaustively until the number of tokens extracted satisfies the target recall.
The Scan execution strategy does not need training and does not send any queries to the database.
Execution time = |Retrieved Docs| · (R + P)
Prioritizing the documents based may help in improving efficiency.
Time for retrieving a document
Time for processing a
document
Execution Strategy: Filtered ScanFiltered scan is an improvement over the
basic scan methodology.Unlike scan filtered scan uses a classifier for
a specific task to check whether the article contributes at least one token before parsing the article.
Execution time = |Retrieved Docs| · (R + P + C) Time for
retrieving a document
Time for processing a
document
Time for classifying a
document
Execution Strategy: Iterative Set Expansion
Output Tokens…Extraction
System
Text Database
3. Extract tokensfrom docs
2. Process retrieved documents
1. Query database with seed tokens
Query Generation
4. Augment seed tokens with new tokens(e.g., [Ebola AND
Zaire])
(e.g., <Malaria, Ethiopia>)
Execution Strategy: Iterative Set Expansion contd…
Iterative Set Expansion has been successfully applied in many tasks.
Execution time = |Retrieved Docs| * (R + P) + |Queries| * Q
Time for retrieving a document
Time for answering a
query
Time for processing a
document
Execution Strategy: Automatic Query Generation Iterative Set Expansion has recall limitation due to
iterative nature of query generation Automatic Query Generation avoids this problem
by creating queries offline (using machine learning), which are designed to return documents with tokens.
Automatic Query Generation works in two stages:In the first stage, Automatic Query Generation trains
a classifier to categorize documents as useful or not for the task
In the execution stage, Automatic Query Generation searches a database using queries that are expected to retrieve useful documents.
Estimating Execution plan costs: ScanModeling Scan for Token t: What is the probability of seeing t
(with frequency g(t)) after retrieving S documents?
A “sampling without replacement” process
After retrieving S documents, frequency of token t follows hypergeometric distribution
Recall for token t is the probability that frequency of t in S docs > 0
t
d1
d2
dS
dN
...
D
Token
Samplingfor t
...
<SARS, China>
Probability of seeing token t after retrieving S documentsg(t) = frequency of token t
Estimating Execution plan costs: Iterative Set Expansion
The querying graph is a bipartite graph, containing tokens and documents
Each token (transformed to a keyword query) retrieves documents
Documents contain tokens
Tokens Documentst1
t2
t3
t4
t5
d1
d2
d3
d4
d5
<SARS, China>
<Ebola, Zaire>
<Malaria, Ethiopia>
<Cholera, Sudan>
<H5N1, Vietnam>
17
Estimating execution plan costs: Iterative Set Expansion contd…
We need to compute the: Number of documents retrieved
after sending Q tokens as queries (estimates time)
Number of tokens that appear in the retrieved documents (estimates recall)
To estimate these we need to compute the:
Degree distribution of the tokens discovered by retrieving documents
Degree distribution of the documents retrieved by the tokens
(Not the same as the degree distribution of a randomly chosen token or document – it is easier to discover documents and tokens with high degrees)
Tokens Documentst1
t2
t3
t4
t5
d1
d2
d3
d4
d5
<SARS, China>
<Ebola, Zaire>
<Malaria, Ethiopia>
<Cholera, Sudan>
<H5N1, Vietnam>
Experimental Results
100
1,000
10,000
100,000
0.00 0.10 0.20 0.30 0.40 0.50 0.60 0.70 0.80 0.90 1.00Recall
Exe
cutio
n Ti
me
(sec
s)
ScanFilt. ScanAutomatic Query Gen.Iterative Set Expansion
100
1,000
10,000
100,000
0.00 0.10 0.20 0.30 0.40 0.50 0.60 0.70 0.80 0.90 1.00Recall
Exe
cutio
n Ti
me
(sec
s)
Scan
Filt. Scan
Iterative Set Expansion
Automatic Query Gen.
OPTIMIZED
Experimental Results contd…
20
ConclusionsCommon execution plans for multiple text-
centric tasks
Analytic models for predicting execution time and recall of various crawl- and query-based plans
Techniques for on-the-fly parameter estimation
Optimization framework picks on-the-fly the fastest plan for target recall
Thank you!