Using CrowdSourcing for Data Analytics - Drexel CCIcci.drexel.edu/bigdata/bigdata2013/Using...

Preview:

Citation preview

1

1

Using CrowdSourcingfor Data Analytics

Hector Garcia-Molina(work with Steven Whang, Peter Lofgren, Aditya Parameswaran and others)

Stanford University

• Big Data Analytics

• CrowdSourcing

2

CrowdSourcing

3

Real World Examples

4

Image MatchingTranslation

Image MatchingTranslation

Categorizing ImagesCategorizing Images

Search RelevanceSearch Relevance

Data GatheringData Gathering

3

Many Crowdsourcing Marketplaces!

Many Research Projects!

6

4

7

analytics

data

results

humans

Example tasks:• get missing data• verify results• analyze data

8

analytics

data

results

humans

Example tasks:• get missing data• verify results• analyze data

Key Point:• use humans judiciously

5

Today will illustrate with

• Entity Resolution

• (may cover another topic briefly)

9

10

Traditional Entity Resolution

cleansing

analysis

System 1 System n...

what matches what??

6

Why is ER Challenging?

• Huge data sets

• No unique identifiers

• Missing data

• Lots of uncertainty

• Many ways to skin the cat

11

Simple ER Example

12

7

Simple ER Example

13

a

c

b

d

sim=0.9

sim=0.8

Simple ER Example

14

a

c

b

d

sim=0.9

sim=0.8

8

15

ER: Exact vs Approximate

products

camerasresolvedcameras

CDs

books

...

resolvedCDs

resolvedbooks

...

ER

ER

ER

Simple ER Algorithm

• Compute pairwise similarities

• Apply threshold

• Perform transitive closure

16

9

Simple ER Algorithm

• Compute pairwise similarities

• Apply threshold

• Perform transitive closure

17

0.45

0.630.5

0.9

0.950.9

0.7

0.9

0.87

0.7 0.4

0.6

0.5

0.63

Simple ER Algorithm

• Compute pairwise similarities

• Apply threshold

• Perform transitive closure

18

0.9

0.950.9

0.7

0.9

0.87

0.7

threshold = 0.7

10

Simple ER Algorithm

• Compute pairwise similarities

• Apply threshold

• Perform transitive closure

19

0.9

0.950.9

0.7

0.9

0.87

0.7

Crowd ER

20

11

Same as this?

21

Crowd ER

• First Cut: For every pair of records,ask workers if they match (i.e., get similarity)

22

12

Crowd ER

• First Cut: For every pair of records,ask workers if they match (i.e., get similarity)

• Too expensive!

23

0.45

0.630.5

0.9

0.950.9

0.7

0.9

0.87

0.7 0.4

0.6

0.5

0.63

Crowd ER

• Second Cut: Compute similarities;workers verify "critical" pairs

24

0.45

0.630.5

0.9

0.950.9

0.7

0.9

0.87

0.7 0.4

0.6

0.5

0.63

critical??

13

Crowd ER

• Second Cut: Compute similarities;workers verify "critical" pairs

25

0.45

0.630.5

0.9

0.950.9

0.7

0.9

0.87

0.7 0.4

0.6

0.5

0.63

critical??

Crowd ER

• Second Cut: Compute similarities;workers verify "critical" pairs

26

0.45

0.630.5

0.9

0.950.9

0.7

0.9

0.87

0.7 0.4

0.6

0.5

0.63

critical??

14

27

pairwise analysis

records

clusters

generate questions

Key Point:• use humans judiciously

global analysis

crowd

new evidence

Key Issue: Semantics of Crowd Answer

28

15

Key Issue: Semantics of Crowd Answer

29

?

EDC

B A

Also issue: Similarities as Probabilities

30

sim(a,b) → prob(a,b)

16

Strategy

31

current state

a

b c

0.9

0.5

0.2

ER result use any given ER algorithm

Strategy

32

current state

a

b c

0.9

0.5

0.2 Q(a,b)

Q(b,c)

Q(a,c)

consider ALL possible questions(three in this example)

17

Strategy

33

current state

a

b c

0.9

0.5

0.2 Q(a,b)

Q(b,c)

Q(a,c)

new state

new state

new state

new state

new state

new state

ER result

ER result

ER result

ER result

ER result

ER result

Y

Y

Y

N

N

N

considerpossibleoutcomes

Strategy

34

current state

a

b c

0.9

0.5

0.2

Q(b,c)

new state ER resultY

example

a

b c

0.9

1.0

0.2a

b c

18

Strategy

35

current state

a

b c

0.9

0.5

0.2 Q(a,b)

Q(b,c)

Q(a,c)

new state

new state

new state

new state

new state

new state

ER result

ER result

ER result

ER result

ER result

ER result

score?

score?

score?

score?

score?

score?

Y

Y

Y

N

N

N

Two Remaining Issues

• How do we score an ER result?

36

• Efficiency?

ER result gold standard

F score

19

Gold Standard?

37

a

b c

0.9

0.5

0.2a

b c

1.0

0.6

0.2

sim to prob

Gold Standard?

38

a

b c

0.9

0.5

0.2a

b c

1.0

0.6

0.2

a

b c

a

b c

a

b c

a

b c

0.12

0.48

0.08

0.32

sim to prob

possible worlds

20

Gold Standard?

39

a

b c

0.9

0.5

0.2a

b c

1.0

0.6

0.2

a

b c

a

b c

a

b c

a

b c

0.12

0.48

0.08

0.32

a

b c

0.68

a

b c

0.32sim to prob

possible worlds possible clustering(via ER algorithm)

Strategy

40

current state

a

b c

0.9

0.5

0.2 Q(a,b)

Q(b,c)

Q(a,c)

new state

new state

new state

new state

new state

new state

ER result

ER result

ER result

ER result

ER result

ER result

scorevs GS?Y

Y

Y

N

N

N

scorevs GS?

scorevs GS?

scorevs GS?

scorevs GS?

scorevs GS?

21

Evaluating Efficiently

• See: Steven E. Whang, Peter Lofgren, and H. Garcia-Molina. Question Selection for Crowd Entity Resolution. To appear in Proc. 39th Int'l Conf. on Very Large Data Bases (PVLDB), Trento, Italy, 2013.

41

Sample Result

42

22

43

analytics

data

results

humans

Example tasks:• get missing data• verify results• analyze data

Key Point:• use humans judiciously

Summary

44

analytics

bigdata

Now for something completely different!

DBMS

23

45

analytics

bigdata

Now for something completely different!

DBMS

humans

46

data

DeCo: Declarative CrowdSourcing

DBMS

humans

Enduser

what is best price forNikon DSLR cameras?

24

47

data

DeCo: Declarative CrowdSourcing

DBMS

humans

Enduser

what is best price forNikon DSLR cameras?

model type brand

D7100 DSLR Nikon

7D DSLR Canon

P5000 comp Nikon

• • • • • • • • •

48

data

DeCo: Declarative CrowdSourcing

DBMS

humans

Enduser

what is best price forNikon DSLR cameras?

model type brand

D7100 DSLR Nikon

7D DSLR Canon

P5000 comp Nikon

• • • • • • • • •

what is best price forNikon D7100 camera?

Crowd

25

restaurant rating cuisine

Chez Panisse 4.9 French

Chez Panisse 4.9 California

Bytes 3.8 California

• • • • • • • • •

Example with a bit more detail:

Userview

restaurant rating cuisine

Chez Panisse 4.9 French

Chez Panisse 4.9 California

Bytes 3.8 California

• • • • • • • • •

⋈o

50

Userview

restaurant

Chez Panisse

Bytes

• • •

restaurant rating

Chez Panisse 4.8

Chez Panisse 5.0

Chez Panisse 4.9

Bytes 3.6

Bytes 4.0

• • • • • •

restaurant cuisine

Chez Panisse French

Chez Panisse California

Bytes California

Bytes California

• • • • • •

• • • • • •

Anchor

Dependent Dependent

Example with a bit more detail:

26

restaurant rating cuisine

Chez Panisse 4.9 French

Chez Panisse 4.9 California

Bytes 3.8 California

• • • • • • • • •

⋈o

51

Userview

restaurant

Chez Panisse

Bytes

• • •

restaurant rating

Chez Panisse 4.8

Chez Panisse 5.0

Chez Panisse 4.9

Bytes 3.6

Bytes 4.0

• • • • • •

restaurant cuisine

Chez Panisse French

Chez Panisse California

Bytes California

Bytes California

• • • • • •

• • • • • •

Anchor

Dependent Dependent

fetch rule

fetch ruleBytes

Chez Panisse

fetch rule

Example with a bit more detail:

restaurant rating cuisine

Chez Panisse 4.9 French

Chez Panisse 4.9 California

Bytes 3.8 California

• • • • • • • • •

⋈o

52

Userview

restaurant

Chez Panisse

Bytes

• • •

restaurant rating

Chez Panisse 4.8

Chez Panisse 5.0

Chez Panisse 4.9

Bytes 3.6

Bytes 4.0

• • • • • •

restaurant cuisine

Chez Panisse French

Chez Panisse California

Bytes California

Bytes California

• • • • • •

• • • • • •

Anchor

Dependent Dependent

fetch rule

fetch ruleBytes

Chez Panisse

fetch rulefetch rule

French

Example with a bit more detail:

27

restaurant rating cuisine

Chez Panisse 4.9 French

Chez Panisse 4.9 California

Bytes 3.8 California

• • • • • • • • •

⋈o

53

Userview

restaurant

Chez Panisse

Bytes

• • •

restaurant rating

Chez Panisse 4.8

Chez Panisse 5.0

Chez Panisse 4.9

Bytes 3.6

Bytes 4.0

• • • • • •

restaurant cuisine

Chez Panisse French

Chez Panisse California

Bytes California

Bytes California

• • • • • •

• • • • • •

Anchor

Dependent Dependent

resolutionrule

resolutionrule

Bytes

Chez Panisse

Example with a bit more detail:

restaurant rating cuisine

Chez Panisse 4.9 French

Chez Panisse 4.9 California

Bytes 3.8 California

• • • • • • • • •

⋈o

54

Userview

restaurant

Chez Panisse

Bytes

• • •

restaurant rating

Chez Panisse 4.8

Chez Panisse 5.0

Chez Panisse 4.9

Bytes 3.6

Bytes 4.0

• • • • • •

restaurant cuisine

Chez Panisse French

Chez Panisse California

Bytes California

Bytes California

• • • • • •

• • • • • •

Anchor

Dependent Dependent

1. Fetch2. Resolve3. Join

Example with a bit more detail:

28

Fetch[n]Fetch[ln]Fetch

[ln,c]Scan

D1(n,l)ScanA(n)

55

Join

Join

AtLeast [8]SELECT n,l,cFROM countryWHERE l = ‘Spanish’ATLEAST 8

Resolve[m3]Resolve[d.e]

Fetch[nl]

ScanD2(n,c)

Resolve[m3]

Fetch[nl,c]

Many Query Processing Challenges

Filter [l=‘Spanish’]

Fetch[nl,c]

Deco Prototype V1.0

56

29

Conclusion

• Crowdsourcing is important formanaging data!

• Still many challenges ahead!

57

58

Recommended