Wrapper Generation Supervised by a Noisy Crowd

Wrapper Generation Supervised by a Noisy Crowd

Valter Crescenzi, Paolo Merialdo, Disheng Qiu

Dipartimento di IngegneriaUniversità degli Studi Roma TreVia della Vasca Navale, 79, Rome

[email protected]

mailto:[email protected]

mailto:[email protected]

Extracting Data

2M pages from IMDB, and we want to extract ... titles, directors etc ....

2

Extracting Data


DB#Wrapper!

2

Extracting Data


Inference algorithm!

DB#Wrapper!

2

r1 = /html/table/tr[1]/td/text()r2 = //*[contains(.,”Ratings:”)]/../../tr[1]/td/text()r3 = //*[contains(.,”Director:”)]/../../tr[1]/td/text()....

Single Page

Other pages

3

Wrapper as XPath

To generate wrappers:

• From a single annotated page, it generates a pool of XPath • All XPath are correct solutions for the annotated page• Some of the rules do not work correctly in all the target pages

page0

r1

r2

r3

Spirited Away

Spirited Away

Spirited Away


Single Page

Other pages

3

Wrapper as XPath



page0 page1 page2 ..r1

r2

r3

Spirited Away City of God Howl’s Moving Castle ..

Spirited Away - 9.3 ..

Spirited Away City of God null ..


Single Page

Other pages

3

Wrapper as XPath



page0 page1 page2 ..r1

r2

r3

Spirited Away City of God Howl’s Moving Castle ..

Spirited Away - 9.3 ..

Spirited Away City of God null ..


Single Page

Other pages

3

Wrapper as XPath



Which one is correct?

Extracting Data


DB#Wrapper!

Scalability Accuracy CoverageSupervised

Unsupervised

Sup.+Annot.

NO OK High

OK NO High

OK OK Low

4

Crowdsourcing

An opportunity to scale supervised approaches


DB#Wrapper!

5

Scaling Wrapper Inference

Scaling out with crowdsourcing platforms opens new challenges:

Issues: Contributions:

Non-expert workers

• Simple interactions• Membership Query (yes/no answer)• Redundant tasks and worker error rate estimation

• Active Learning*• Dynamically engaging workers

Costs

Quality• Quality Model• Sampling algorithm*

6*[Crescenzi WWW2013]

page0 page1 page2

r1

r2

r3

Spirited Away City of God Howl’s Moving Castle

Spirited Away - 9.3

Spirited Away City of God null

Inference Algorithm


Yes/No !

First annotation

Sample

Worker’s answers

7

page0 page1 page2

r1

r2

r3


Spirited Away - 9.3


Inference Algorithm


Yes/No !

First annotation

Sample

Worker’s answers

7

Quality Model: P(r1)

page0 page1 page2

r1

r2

r3


Spirited Away - 9.3


Inference Algorithm

• Rules compatible with the answer more likely to be correct

For each new answer


Yes/No !

First annotation

Sample

Worker’s answers

7


page0 page1 page2

r1

r2

r3


Spirited Away - 9.3


Inference Algorithm


For each new answer

• If no rule is good enough:• a new query is selected (Active Learning)*


Yes/No !

First annotation

Sample

Worker’s answers



page0 page1 page2

r1

r2

r3


Spirited Away - 9.3


Inference Algorithm


For each new answer

• If no rule is good enough:• a new query is selected (Active Learning)*


Yes/No !

First annotation

Sample

Worker’s answers



Termination Strategies

8

Quality

Costs

HALTᵣExpected quality of the wrapper (probability of correctness)

HALTMQ

Number of used MQ

Quality

Costs

HALTH

Uncertainty of the questioned value (trade-off quality/costs)

Different termination strategies:

Multiple Workers

Workers can make mistakes

We engage multiple workers on the same task, but how many?

?

9

Multiple Workers



Too many workers

Not enough workers

Waste of money

Quality loss

?

9

Multiple Workers



Too many workers

Not enough workers

Waste of money

Quality loss

We apply our quality model at runtime to:

• Estimate the workers’ error rates

• Select the right number of redundant tasks

?

9

Dynamically Engaging Workers

Workersanswers

Most Likely Rule

Is it good enough?

• Starts with minimal amount of redundancy

• Collects workers’ answers

• Estimates rule quality and workers’ error rate. Use

• workers’ error rate to estimate rule quality• rule quality to estimate workers’ error rate

• If no rule is good enough a new worker is engaged

Error rate estimation

10

Algorithm main steps:


Workersanswers

Most Likely Rule

Is it good enough?

• Starts with minimal amount of redundancy

• Collects workers’ answers

• Estimates rule quality and workers’ error rate. Use

• workers’ error rate to estimate rule quality• rule quality to estimate workers’ error rate

• If no rule is good enough a new worker is engaged

Error rate estimation

+

10

Algorithm main steps:

Answers “Spirited Away” “-” “9.3”

η 0.1 0.1 0.1

NoYes NoAnswers “Spirited Away” “City of God” “9.3”

η 0.1 0.1 0.1

NoYes No

Answers “Spirited Away” “City of God” “9.3” “Spirited Away” “-” “9.3”

η 0.1 0.1 0.1 0.1 0.1 0.1

NoYes No Yes No No

• Two real workers are engaged

• A new sequence is defined considering the union of all the answers

11

η = expected error rate



η 0.1 0.1 0.1


η 0.1 0.1 0.1

NoYes No

Answers “Spirited Away” “City of God” “9.3” “Spirited Away” “-” “9.3”

η 0.1 0.1 0.1 0.1 0.1 0.1

NoYes No Yes No No

• Two real workers are engaged

• A new sequence is defined considering the union of all the answers

11



• The most likely rule and its values are returned

• The most likely rule and its probability is adopted to estimate the η

page0 page1 page2

r1

r2

r3


Spirited Away - 9.3


es: P(r1) = 0.9

12



η 0.1 0.1 0.1


η 0.1 0.1 0.1

NoYes No




page0 page1 page2

r1

r2

r3


Spirited Away - 9.3


es: P(r1) = 0.9

12



η 0.1 0.1 0.1


η 0.37 0.37 0.37

NoYes No




page0 page1 page2

r1

r2

r3


Spirited Away - 9.3


es: P(r1) = 0.9

P(r1) = 0.93

12



η 0.1 0.1 0.1


η 0.37 0.37 0.37

NoYes No


page0 page1 page2

r1

r2

r3


Spirited Away - 9.3


P(r1) = 0.95

• When the computation converges, the system checks the termination condition

• If it is not met, a new worker is considered and the computation starts again

13



η 0.05 0.05 0.05


η 0.35 0.35 0.35

NoYes No


page0 page1 page2

r1

r2

r3


Spirited Away - 9.3


P(r1) = 0.95

P(r1) = 0.95

• When the computation converges, the system checks the termination condition

• If it is not met, a new worker is considered and the computation starts again

13



η 0.05 0.05 0.05


η 0.35 0.35 0.35

NoYes No


Experiments - Dataset

Site Entity |Pages|www.imdb.com Actor 500k

www.imdb.com Movies 500k

www.allmusic.com Band 500k

www.allmusic.com Albums 500k

www.nasdaq.com Stock Quotes 7k

40 attributes

manually crafted golden rules

Measures:

• Costs #MQ• Quality Precision, Recall and F-measure

14

http://www.imdb.com

http://www.imdb.com

http://www.allmusic.com




http://www.nasdaq.com

http://www.nasdaq.com

Simulating Real Workers

0%

10%

20%

30%

40%

0.00 0.10 0.20 0.30 0.40 0.50

error rate�e��x

100 Real (and noisy) AMT workers

Real workers: 1/3 perfect Average η* = 10% ση* = 11%

We simulated the error rate distribution with an exponential function

15

η* > η (optimistic) η* = η (correct) η* < η (pessimistic)

MQ ~10 ~10 ~30

F ~0.65 ~1 ~1

Wrong Estimation

Noisy single worker: - η expected error rate - η* observed error rate

16


MQ ~10 ~10 ~30

F ~0.65 ~1 ~1

Wrong Estimation


16

η close to η*:(good estimation) - few MQ - good F


MQ ~10 ~10 ~30

F ~0.65 ~1 ~1

Wrong Estimation


16


η* > η:(too optimistic) - too few MQ - low F


MQ ~10 ~10 ~30

F ~0.65 ~1 ~1

Wrong Estimation


16



η > η*:(too pessimistic) - too many MQ - same F


MQ ~10 ~10 ~30

F ~0.65 ~1 ~1

Wrong Estimation


16



η > η*:(too pessimistic) - too many MQ - same F

Need to estimate the workers’ error rate


Algorithm F σF #MQ max-MQ max-|W| |η-η*|

ALFη one worker

0.92 17% 7.58 11 1 -

ALFREDno 1 1% 18.6 83 9 -

ALFRED 1 1% 16.1 44 4 0.8%

ALFRED* 1 1% 16.07 40 4 0%

Synthetic (and noisy) workers |W| = # workers

17



ALFη one worker

0.92 17% 7.58 11 1 -

ALFREDno 1 1% 18.6 83 9 -

ALFRED 1 1% 16.1 44 4 0.8%

ALFRED* 1 1% 16.07 40 4 0%


17

lower quality, less MQ



ALFη one worker

0.92 17% 7.58 11 1 -

ALFREDno 1 1% 18.6 83 9 -

ALFRED 1 1% 16.1 44 4 0.8%

ALFRED* 1 1% 16.07 40 4 0%


17


Almost perfect wrapper



ALFη one worker

0.92 17% 7.58 11 1 -

ALFREDno 1 1% 18.6 83 9 -

ALFRED 1 1% 16.1 44 4 0.8%

ALFRED* 1 1% 16.07 40 4 0%


17


correct estimation required




ALFη one worker

0.92 17% 7.58 11 1 -

ALFREDno 1 1% 18.6 83 9 -

ALFRED 1 1% 16.1 44 4 0.8%

ALFRED* 1 1% 16.07 40 4 0%


17



accurate estimation, but achieved only at the end




ALFη one worker

0.92 17% 7.58 11 1 -

ALFREDno 1 1% 18.6 83 9 -

ALFRED 1 1% 16.1 44 4 0.8%

ALFRED* 1 1% 16.07 40 4 0%


17



accurate estimation, but achieved only at the end


2

3

4

0% 25% 50% 75% 100%

2%

6%

92%

% |W|

|W|

|W|

Background in solid machine learning and computational learning theories*

Conclusions

18

We proposed a framework for wrapper generation:

• simple tasks can be completed by non expert workers

• cost effective wrapper generation

• highly predictable quality of the output wrapper

The proposed framework can be applied to other learning tasks:• Crawling• NLP

*[Angluin-Laird1988, Angluin2001]

Thank you for the attention !!

19

Future development

Learning framework applied to problems (NLP, Entity Linkage)

ALFRED adopted to learn structure-driven crawling algorithm

Hybrid approaches human annotations and automatic annotations

Alternative models of truth/error rate

Optimizing the initial number of workers

20

Wrong Estimation

Noisy single worker: - η = 0.1 - η* = from 0.05 to 0.4

21

0.4

0.5

0.6

0.7

0.8

0.9

1

1.1

0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4F

d�

HALTrHALTHHALTMQ

4

6

8

10

12

14

0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4

MQ

d�

HALTrHALTH

Wrong Estimation

Noisy single worker: - η = from 0 to 0.4 - η* = 0.1

22

0.5

0.6

0.7

0.8

0.9

1

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4F

d

HALTrHALTHHALTMQ

3

10

100

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4

MQ

d

HALTrHALTH

Sampling & Quality

page0

r1

r2

r3

Spirited Away

Spirited Away

Spirited Away

r1 = r2 = r3

23

Sampling & Quality

page0

r1

r2

r3

Spirited Away

Spirited Away

Spirited Away

r1 = r2 = r3

page0 page1

r1

r2

r3

Spirited Away City of God

Spirited Away -


r1 = r3 ≠ r2

23

Sampling & Quality

page0

r1

r2

r3

Spirited Away

Spirited Away

Spirited Away

r1 = r2 = r3

page0 page1

r1

r2

r3


Spirited Away -


r1 = r3 ≠ r2

page0 page1 page2

r1

r2

r3


Spirited Away - 9.3


r1 ≠ r3 ≠ r2

23

Sampling & Quality

page0

r1

r2

r3

Spirited Away

Spirited Away

Spirited Away

r1 = r2 = r3

page0 page1

r1

r2

r3


Spirited Away -


r1 = r3 ≠ r2

page0 page1 page2

r1

r2

r3


Spirited Away - 9.3


r1 ≠ r3 ≠ r2

Pages make apparent the differences among the rules

Find a small set that makes apparent the same differences observed in the

whole set of pages

23

Sampling & Quality

The problem.

Find the smallest set that makes apparent the differences among the rules:(e.g., 100 pages that make apparent the same differences that we would observe in 2M pages).

It is a NP-Hard problem !! Reduction to SET-Cover problem:Find the smallest set of pages that cover all the group of rules (group = equivalent rules).

The smallest set is not needed:A greedy algorithm O(|Pages|) in time and O(1) in space works very well in practice.

24

XPath rules

For every page p: if (p makes apparent new differences) representative pages += p

An offline algorithm that can be easily parallelized

Sampling & Quality

25

Sampling

Entity Sampling |Pages| P R

Movies

Biased 250 0.98 0.71

Movies Random 250 0.99 0.99Movies

Representative 42 1.00 1.00

Actors

Biased 250 1.00 1.00

Actors Random 250 1.00 0.96Actors


Stocks

Biased 86 1.00 0.98

Stocks Random 86 1.00 0.99Stocks


Albums

Biased 258 1.00 0.99

Albums Random 258 1.00 1.00Albums


Bands

Biased 289 1.00 0.68

Bands Random 289 1.00 1.00Bands


26

Sampling


Movies

Biased 250 0.98 0.71



Actors

Biased 250 1.00 1.00



Stocks

Biased 86 1.00 0.98



Albums

Biased 258 1.00 0.99



Bands

Biased 289 1.00 0.68



Representative perfect

26

Sampling


Movies

Biased 250 0.98 0.71



Actors

Biased 250 1.00 1.00



Stocks

Biased 86 1.00 0.98



Albums

Biased 258 1.00 0.99



Bands

Biased 289 1.00 0.68



Biased: recall loss

26

Sampling


Movies

Biased 250 0.98 0.71



Actors

Biased 250 1.00 1.00



Stocks

Biased 86 1.00 0.98



Albums

Biased 258 1.00 0.99



Bands

Biased 289 1.00 0.68



Random: better than biasedbut not perfect

26

27

Related Wrapper Generation

Automatic Wrappers for Large Scale Web Extraction Nilesh Dalvi et. al VLDB2011

DIADEM T. Furche

G. Gottlob ... etcWWW2012

Web Data Extraction Based on Partial Tree Alignment Yanhong Zhai WWW2005

Extracting Structured Data from Web PagesArvind Arasu

Hector Garcia-MolinaSIGMOD

2003

RoadRunner Crescenzi VLDB2001

Wrapper Induction for information extraction Kushmerick IJCAI97

Active Learning with Multiple Views Ion Muslea JAIR2006

Interactive Wrapper Generation with Minimal User Effort Utku Irmak WWW2006

Education

Wrapper Generation Supervised by a Noisy Crowd