Upload
disheng-qiu
View
347
Download
0
Tags:
Embed Size (px)
Citation preview
Wrapper Generation Supervised by a Noisy Crowd
Valter Crescenzi, Paolo Merialdo, Disheng Qiu
Dipartimento di IngegneriaUniversità degli Studi Roma TreVia della Vasca Navale, 79, Rome
Extracting Data
2M pages from IMDB, and we want to extract ... titles, directors etc ....
DB#Wrapper!
2
Extracting Data
2M pages from IMDB, and we want to extract ... titles, directors etc ....
Inference algorithm!
DB#Wrapper!
2
r1 = /html/table/tr[1]/td/text()r2 = //*[contains(.,”Ratings:”)]/../../tr[1]/td/text()r3 = //*[contains(.,”Director:”)]/../../tr[1]/td/text()....
Single Page
Other pages
3
Wrapper as XPath
To generate wrappers:
• From a single annotated page, it generates a pool of XPath • All XPath are correct solutions for the annotated page• Some of the rules do not work correctly in all the target pages
page0
r1
r2
r3
Spirited Away
Spirited Away
Spirited Away
r1 = /html/table/tr[1]/td/text()r2 = //*[contains(.,”Ratings:”)]/../../tr[1]/td/text()r3 = //*[contains(.,”Director:”)]/../../tr[1]/td/text()....
Single Page
Other pages
3
Wrapper as XPath
To generate wrappers:
• From a single annotated page, it generates a pool of XPath • All XPath are correct solutions for the annotated page• Some of the rules do not work correctly in all the target pages
page0 page1 page2 ..r1
r2
r3
Spirited Away City of God Howl’s Moving Castle ..
Spirited Away - 9.3 ..
Spirited Away City of God null ..
r1 = /html/table/tr[1]/td/text()r2 = //*[contains(.,”Ratings:”)]/../../tr[1]/td/text()r3 = //*[contains(.,”Director:”)]/../../tr[1]/td/text()....
Single Page
Other pages
3
Wrapper as XPath
To generate wrappers:
• From a single annotated page, it generates a pool of XPath • All XPath are correct solutions for the annotated page• Some of the rules do not work correctly in all the target pages
page0 page1 page2 ..r1
r2
r3
Spirited Away City of God Howl’s Moving Castle ..
Spirited Away - 9.3 ..
Spirited Away City of God null ..
r1 = /html/table/tr[1]/td/text()r2 = //*[contains(.,”Ratings:”)]/../../tr[1]/td/text()r3 = //*[contains(.,”Director:”)]/../../tr[1]/td/text()....
Single Page
Other pages
3
Wrapper as XPath
To generate wrappers:
• From a single annotated page, it generates a pool of XPath • All XPath are correct solutions for the annotated page• Some of the rules do not work correctly in all the target pages
Which one is correct?
Extracting Data
Inference algorithm!
DB#Wrapper!
Scalability Accuracy CoverageSupervised
Unsupervised
Sup.+Annot.
NO OK High
OK NO High
OK OK Low
4
Scaling Wrapper Inference
Scaling out with crowdsourcing platforms opens new challenges:
Issues: Contributions:
Non-expert workers
• Simple interactions• Membership Query (yes/no answer)• Redundant tasks and worker error rate estimation
• Active Learning*• Dynamically engaging workers
Costs
Quality• Quality Model• Sampling algorithm*
6*[Crescenzi WWW2013]
page0 page1 page2
r1
r2
r3
Spirited Away City of God Howl’s Moving Castle
Spirited Away - 9.3
Spirited Away City of God null
Inference Algorithm
r1 = /html/table/tr[1]/td/text()r2 = //*[contains(.,”Ratings:”)]/../../tr[1]/td/text()r3 = //*[contains(.,”Director:”)]/../../tr[1]/td/text()....
Yes/No !
First annotation
Sample
Worker’s answers
7
page0 page1 page2
r1
r2
r3
Spirited Away City of God Howl’s Moving Castle
Spirited Away - 9.3
Spirited Away City of God null
Inference Algorithm
r1 = /html/table/tr[1]/td/text()r2 = //*[contains(.,”Ratings:”)]/../../tr[1]/td/text()r3 = //*[contains(.,”Director:”)]/../../tr[1]/td/text()....
Yes/No !
First annotation
Sample
Worker’s answers
7
Quality Model: P(r1)
page0 page1 page2
r1
r2
r3
Spirited Away City of God Howl’s Moving Castle
Spirited Away - 9.3
Spirited Away City of God null
Inference Algorithm
• Rules compatible with the answer more likely to be correct
For each new answer
r1 = /html/table/tr[1]/td/text()r2 = //*[contains(.,”Ratings:”)]/../../tr[1]/td/text()r3 = //*[contains(.,”Director:”)]/../../tr[1]/td/text()....
Yes/No !
First annotation
Sample
Worker’s answers
7
Quality Model: P(r1)
page0 page1 page2
r1
r2
r3
Spirited Away City of God Howl’s Moving Castle
Spirited Away - 9.3
Spirited Away City of God null
Inference Algorithm
• Rules compatible with the answer more likely to be correct
For each new answer
• If no rule is good enough:• a new query is selected (Active Learning)*
r1 = /html/table/tr[1]/td/text()r2 = //*[contains(.,”Ratings:”)]/../../tr[1]/td/text()r3 = //*[contains(.,”Director:”)]/../../tr[1]/td/text()....
Yes/No !
First annotation
Sample
Worker’s answers
7*[Crescenzi WWW2013]
Quality Model: P(r1)
page0 page1 page2
r1
r2
r3
Spirited Away City of God Howl’s Moving Castle
Spirited Away - 9.3
Spirited Away City of God null
Inference Algorithm
• Rules compatible with the answer more likely to be correct
For each new answer
• If no rule is good enough:• a new query is selected (Active Learning)*
r1 = /html/table/tr[1]/td/text()r2 = //*[contains(.,”Ratings:”)]/../../tr[1]/td/text()r3 = //*[contains(.,”Director:”)]/../../tr[1]/td/text()....
Yes/No !
First annotation
Sample
Worker’s answers
7*[Crescenzi WWW2013]
Quality Model: P(r1)
Termination Strategies
8
Quality
Costs
HALTᵣExpected quality of the wrapper (probability of correctness)
HALTMQ
Number of used MQ
Quality
Costs
HALTH
Uncertainty of the questioned value (trade-off quality/costs)
Different termination strategies:
Multiple Workers
Workers can make mistakes
We engage multiple workers on the same task, but how many?
?
9
Multiple Workers
Workers can make mistakes
We engage multiple workers on the same task, but how many?
Too many workers
Not enough workers
Waste of money
Quality loss
?
9
Multiple Workers
Workers can make mistakes
We engage multiple workers on the same task, but how many?
Too many workers
Not enough workers
Waste of money
Quality loss
We apply our quality model at runtime to:
• Estimate the workers’ error rates
• Select the right number of redundant tasks
?
9
Dynamically Engaging Workers
Workersanswers
Most Likely Rule
Is it good enough?
• Starts with minimal amount of redundancy
• Collects workers’ answers
• Estimates rule quality and workers’ error rate. Use
• workers’ error rate to estimate rule quality• rule quality to estimate workers’ error rate
• If no rule is good enough a new worker is engaged
Error rate estimation
10
Algorithm main steps:
Dynamically Engaging Workers
Workersanswers
Most Likely Rule
Is it good enough?
• Starts with minimal amount of redundancy
• Collects workers’ answers
• Estimates rule quality and workers’ error rate. Use
• workers’ error rate to estimate rule quality• rule quality to estimate workers’ error rate
• If no rule is good enough a new worker is engaged
Error rate estimation
+
10
Algorithm main steps:
Answers “Spirited Away” “-” “9.3”
η 0.1 0.1 0.1
NoYes NoAnswers “Spirited Away” “City of God” “9.3”
η 0.1 0.1 0.1
NoYes No
Answers “Spirited Away” “City of God” “9.3” “Spirited Away” “-” “9.3”
η 0.1 0.1 0.1 0.1 0.1 0.1
NoYes No Yes No No
• Two real workers are engaged
• A new sequence is defined considering the union of all the answers
11
η = expected error rate
Dynamically Engaging Workers
Answers “Spirited Away” “-” “9.3”
η 0.1 0.1 0.1
NoYes NoAnswers “Spirited Away” “City of God” “9.3”
η 0.1 0.1 0.1
NoYes No
Answers “Spirited Away” “City of God” “9.3” “Spirited Away” “-” “9.3”
η 0.1 0.1 0.1 0.1 0.1 0.1
NoYes No Yes No No
• Two real workers are engaged
• A new sequence is defined considering the union of all the answers
11
η = expected error rate
Dynamically Engaging Workers
• The most likely rule and its values are returned
• The most likely rule and its probability is adopted to estimate the η
page0 page1 page2
r1
r2
r3
Spirited Away City of God Howl’s Moving Castle
Spirited Away - 9.3
Spirited Away City of God null
es: P(r1) = 0.9
12
η = expected error rate
Answers “Spirited Away” “-” “9.3”
η 0.1 0.1 0.1
NoYes NoAnswers “Spirited Away” “City of God” “9.3”
η 0.1 0.1 0.1
NoYes No
Dynamically Engaging Workers
• The most likely rule and its values are returned
• The most likely rule and its probability is adopted to estimate the η
page0 page1 page2
r1
r2
r3
Spirited Away City of God Howl’s Moving Castle
Spirited Away - 9.3
Spirited Away City of God null
es: P(r1) = 0.9
12
η = expected error rate
Answers “Spirited Away” “-” “9.3”
η 0.1 0.1 0.1
NoYes NoAnswers “Spirited Away” “City of God” “9.3”
η 0.37 0.37 0.37
NoYes No
Dynamically Engaging Workers
• The most likely rule and its values are returned
• The most likely rule and its probability is adopted to estimate the η
page0 page1 page2
r1
r2
r3
Spirited Away City of God Howl’s Moving Castle
Spirited Away - 9.3
Spirited Away City of God null
es: P(r1) = 0.9
P(r1) = 0.93
12
η = expected error rate
Answers “Spirited Away” “-” “9.3”
η 0.1 0.1 0.1
NoYes NoAnswers “Spirited Away” “City of God” “9.3”
η 0.37 0.37 0.37
NoYes No
Dynamically Engaging Workers
page0 page1 page2
r1
r2
r3
Spirited Away City of God Howl’s Moving Castle
Spirited Away - 9.3
Spirited Away City of God null
P(r1) = 0.95
• When the computation converges, the system checks the termination condition
• If it is not met, a new worker is considered and the computation starts again
13
η = expected error rate
Answers “Spirited Away” “-” “9.3”
η 0.05 0.05 0.05
NoYes NoAnswers “Spirited Away” “City of God” “9.3”
η 0.35 0.35 0.35
NoYes No
Dynamically Engaging Workers
page0 page1 page2
r1
r2
r3
Spirited Away City of God Howl’s Moving Castle
Spirited Away - 9.3
Spirited Away City of God null
P(r1) = 0.95
P(r1) = 0.95
• When the computation converges, the system checks the termination condition
• If it is not met, a new worker is considered and the computation starts again
13
η = expected error rate
Answers “Spirited Away” “-” “9.3”
η 0.05 0.05 0.05
NoYes NoAnswers “Spirited Away” “City of God” “9.3”
η 0.35 0.35 0.35
NoYes No
Dynamically Engaging Workers
Experiments - Dataset
Site Entity |Pages|www.imdb.com Actor 500k
www.imdb.com Movies 500k
www.allmusic.com Band 500k
www.allmusic.com Albums 500k
www.nasdaq.com Stock Quotes 7k
40 attributes
manually crafted golden rules
Measures:
• Costs #MQ• Quality Precision, Recall and F-measure
14
Simulating Real Workers
0%
10%
20%
30%
40%
0.00 0.10 0.20 0.30 0.40 0.50
error rate�e��x
100 Real (and noisy) AMT workers
Real workers: 1/3 perfect Average η* = 10% ση* = 11%
We simulated the error rate distribution with an exponential function
15
η* > η (optimistic) η* = η (correct) η* < η (pessimistic)
MQ ~10 ~10 ~30
F ~0.65 ~1 ~1
Wrong Estimation
Noisy single worker: - η expected error rate - η* observed error rate
16
η* > η (optimistic) η* = η (correct) η* < η (pessimistic)
MQ ~10 ~10 ~30
F ~0.65 ~1 ~1
Wrong Estimation
Noisy single worker: - η expected error rate - η* observed error rate
16
η close to η*:(good estimation) - few MQ - good F
η* > η (optimistic) η* = η (correct) η* < η (pessimistic)
MQ ~10 ~10 ~30
F ~0.65 ~1 ~1
Wrong Estimation
Noisy single worker: - η expected error rate - η* observed error rate
16
η close to η*:(good estimation) - few MQ - good F
η* > η:(too optimistic) - too few MQ - low F
η* > η (optimistic) η* = η (correct) η* < η (pessimistic)
MQ ~10 ~10 ~30
F ~0.65 ~1 ~1
Wrong Estimation
Noisy single worker: - η expected error rate - η* observed error rate
16
η close to η*:(good estimation) - few MQ - good F
η* > η:(too optimistic) - too few MQ - low F
η > η*:(too pessimistic) - too many MQ - same F
η* > η (optimistic) η* = η (correct) η* < η (pessimistic)
MQ ~10 ~10 ~30
F ~0.65 ~1 ~1
Wrong Estimation
Noisy single worker: - η expected error rate - η* observed error rate
16
η close to η*:(good estimation) - few MQ - good F
η* > η:(too optimistic) - too few MQ - low F
η > η*:(too pessimistic) - too many MQ - same F
Need to estimate the workers’ error rate
Dynamically Engaging Workers
Algorithm F σF #MQ max-MQ max-|W| |η-η*|
ALFη one worker
0.92 17% 7.58 11 1 -
ALFREDno 1 1% 18.6 83 9 -
ALFRED 1 1% 16.1 44 4 0.8%
ALFRED* 1 1% 16.07 40 4 0%
Synthetic (and noisy) workers |W| = # workers
17
Dynamically Engaging Workers
Algorithm F σF #MQ max-MQ max-|W| |η-η*|
ALFη one worker
0.92 17% 7.58 11 1 -
ALFREDno 1 1% 18.6 83 9 -
ALFRED 1 1% 16.1 44 4 0.8%
ALFRED* 1 1% 16.07 40 4 0%
Synthetic (and noisy) workers |W| = # workers
17
lower quality, less MQ
Dynamically Engaging Workers
Algorithm F σF #MQ max-MQ max-|W| |η-η*|
ALFη one worker
0.92 17% 7.58 11 1 -
ALFREDno 1 1% 18.6 83 9 -
ALFRED 1 1% 16.1 44 4 0.8%
ALFRED* 1 1% 16.07 40 4 0%
Synthetic (and noisy) workers |W| = # workers
17
lower quality, less MQ
Almost perfect wrapper
Dynamically Engaging Workers
Algorithm F σF #MQ max-MQ max-|W| |η-η*|
ALFη one worker
0.92 17% 7.58 11 1 -
ALFREDno 1 1% 18.6 83 9 -
ALFRED 1 1% 16.1 44 4 0.8%
ALFRED* 1 1% 16.07 40 4 0%
Synthetic (and noisy) workers |W| = # workers
17
lower quality, less MQ
correct estimation required
Almost perfect wrapper
Dynamically Engaging Workers
Algorithm F σF #MQ max-MQ max-|W| |η-η*|
ALFη one worker
0.92 17% 7.58 11 1 -
ALFREDno 1 1% 18.6 83 9 -
ALFRED 1 1% 16.1 44 4 0.8%
ALFRED* 1 1% 16.07 40 4 0%
Synthetic (and noisy) workers |W| = # workers
17
lower quality, less MQ
correct estimation required
accurate estimation, but achieved only at the end
Almost perfect wrapper
Dynamically Engaging Workers
Algorithm F σF #MQ max-MQ max-|W| |η-η*|
ALFη one worker
0.92 17% 7.58 11 1 -
ALFREDno 1 1% 18.6 83 9 -
ALFRED 1 1% 16.1 44 4 0.8%
ALFRED* 1 1% 16.07 40 4 0%
Synthetic (and noisy) workers |W| = # workers
17
lower quality, less MQ
correct estimation required
accurate estimation, but achieved only at the end
Almost perfect wrapper
2
3
4
0% 25% 50% 75% 100%
2%
6%
92%
% |W|
|W|
|W|
Background in solid machine learning and computational learning theories*
Conclusions
18
We proposed a framework for wrapper generation:
• simple tasks can be completed by non expert workers
• cost effective wrapper generation
• highly predictable quality of the output wrapper
The proposed framework can be applied to other learning tasks:• Crawling• NLP
*[Angluin-Laird1988, Angluin2001]
Future development
Learning framework applied to problems (NLP, Entity Linkage)
ALFRED adopted to learn structure-driven crawling algorithm
Hybrid approaches human annotations and automatic annotations
Alternative models of truth/error rate
Optimizing the initial number of workers
20
Wrong Estimation
Noisy single worker: - η = 0.1 - η* = from 0.05 to 0.4
21
0.4
0.5
0.6
0.7
0.8
0.9
1
1.1
0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4F
d�
HALTrHALTHHALTMQ
4
6
8
10
12
14
0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4
MQ
d�
HALTrHALTH
Wrong Estimation
Noisy single worker: - η = from 0 to 0.4 - η* = 0.1
22
0.5
0.6
0.7
0.8
0.9
1
0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4F
d
HALTrHALTHHALTMQ
3
10
100
0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4
MQ
d
HALTrHALTH
Sampling & Quality
page0
r1
r2
r3
Spirited Away
Spirited Away
Spirited Away
r1 = r2 = r3
page0 page1
r1
r2
r3
Spirited Away City of God
Spirited Away -
Spirited Away City of God
r1 = r3 ≠ r2
23
Sampling & Quality
page0
r1
r2
r3
Spirited Away
Spirited Away
Spirited Away
r1 = r2 = r3
page0 page1
r1
r2
r3
Spirited Away City of God
Spirited Away -
Spirited Away City of God
r1 = r3 ≠ r2
page0 page1 page2
r1
r2
r3
Spirited Away City of God Howl’s Moving Castle
Spirited Away - 9.3
Spirited Away City of God null
r1 ≠ r3 ≠ r2
23
Sampling & Quality
page0
r1
r2
r3
Spirited Away
Spirited Away
Spirited Away
r1 = r2 = r3
page0 page1
r1
r2
r3
Spirited Away City of God
Spirited Away -
Spirited Away City of God
r1 = r3 ≠ r2
page0 page1 page2
r1
r2
r3
Spirited Away City of God Howl’s Moving Castle
Spirited Away - 9.3
Spirited Away City of God null
r1 ≠ r3 ≠ r2
Pages make apparent the differences among the rules
Find a small set that makes apparent the same differences observed in the
whole set of pages
23
Sampling & Quality
The problem.
Find the smallest set that makes apparent the differences among the rules:(e.g., 100 pages that make apparent the same differences that we would observe in 2M pages).
It is a NP-Hard problem !! Reduction to SET-Cover problem:Find the smallest set of pages that cover all the group of rules (group = equivalent rules).
The smallest set is not needed:A greedy algorithm O(|Pages|) in time and O(1) in space works very well in practice.
24
XPath rules
For every page p: if (p makes apparent new differences) representative pages += p
An offline algorithm that can be easily parallelized
Sampling & Quality
25
Sampling
Entity Sampling |Pages| P R
Movies
Biased 250 0.98 0.71
Movies Random 250 0.99 0.99Movies
Representative 42 1.00 1.00
Actors
Biased 250 1.00 1.00
Actors Random 250 1.00 0.96Actors
Representative 30 1.00 1.00
Stocks
Biased 86 1.00 0.98
Stocks Random 86 1.00 0.99Stocks
Representative 15 1.00 1.00
Albums
Biased 258 1.00 0.99
Albums Random 258 1.00 1.00Albums
Representative 59 1.00 1.00
Bands
Biased 289 1.00 0.68
Bands Random 289 1.00 1.00Bands
Representative 36 1.00 1.00
26
Sampling
Entity Sampling |Pages| P R
Movies
Biased 250 0.98 0.71
Movies Random 250 0.99 0.99Movies
Representative 42 1.00 1.00
Actors
Biased 250 1.00 1.00
Actors Random 250 1.00 0.96Actors
Representative 30 1.00 1.00
Stocks
Biased 86 1.00 0.98
Stocks Random 86 1.00 0.99Stocks
Representative 15 1.00 1.00
Albums
Biased 258 1.00 0.99
Albums Random 258 1.00 1.00Albums
Representative 59 1.00 1.00
Bands
Biased 289 1.00 0.68
Bands Random 289 1.00 1.00Bands
Representative 36 1.00 1.00
Representative perfect
26
Sampling
Entity Sampling |Pages| P R
Movies
Biased 250 0.98 0.71
Movies Random 250 0.99 0.99Movies
Representative 42 1.00 1.00
Actors
Biased 250 1.00 1.00
Actors Random 250 1.00 0.96Actors
Representative 30 1.00 1.00
Stocks
Biased 86 1.00 0.98
Stocks Random 86 1.00 0.99Stocks
Representative 15 1.00 1.00
Albums
Biased 258 1.00 0.99
Albums Random 258 1.00 1.00Albums
Representative 59 1.00 1.00
Bands
Biased 289 1.00 0.68
Bands Random 289 1.00 1.00Bands
Representative 36 1.00 1.00
Biased: recall loss
26
Sampling
Entity Sampling |Pages| P R
Movies
Biased 250 0.98 0.71
Movies Random 250 0.99 0.99Movies
Representative 42 1.00 1.00
Actors
Biased 250 1.00 1.00
Actors Random 250 1.00 0.96Actors
Representative 30 1.00 1.00
Stocks
Biased 86 1.00 0.98
Stocks Random 86 1.00 0.99Stocks
Representative 15 1.00 1.00
Albums
Biased 258 1.00 0.99
Albums Random 258 1.00 1.00Albums
Representative 59 1.00 1.00
Bands
Biased 289 1.00 0.68
Bands Random 289 1.00 1.00Bands
Representative 36 1.00 1.00
Random: better than biasedbut not perfect
26
27
Related Wrapper Generation
Automatic Wrappers for Large Scale Web Extraction Nilesh Dalvi et. al VLDB2011
DIADEM T. Furche
G. Gottlob ... etcWWW2012
Web Data Extraction Based on Partial Tree Alignment Yanhong Zhai WWW2005
Extracting Structured Data from Web PagesArvind Arasu
Hector Garcia-MolinaSIGMOD
2003
RoadRunner Crescenzi VLDB2001
Wrapper Induction for information extraction Kushmerick IJCAI97
Active Learning with Multiple Views Ion Muslea JAIR2006
Interactive Wrapper Generation with Minimal User Effort Utku Irmak WWW2006