SampleClean: Bringing Data Cleaning into the BDAS Stack

  • View
    1.059

  • Download
    7

  • Category

    Software

Preview:

DESCRIPTION

"SampleClean: Bringing Data Cleaning into the BDAS Stack" presentation at AMPCamp 5 by Sanjay Krishnan and Daniel Haas

Citation preview

SampleClean: Bringing Data Cleaning into the BDAS Stack!

Sanjay Krishnan and Daniel Haas!In Collaboration With: Juan Sanchez, Wenbo Tao, Jiannan

Wang, Tim Kraska, Michael Franklin, Tova Milo, Ken Goldberg !

Who publishes more? !

!!

2  

Microsoft Academic Search!

!! Paper Id! Affiliation!

16! Computer Science Division--University of California Berkeley CA!

101! University of California at Berkeley!

102! Department of Physics Stanford !University California!

116! Lawrence Berkeley National Labs!<ref>California</ref>!

3  

Microsoft Academic Search!

!! Paper Id! Affiliation!

16! Computer Science Division--University of California Berkeley CA!

101! University of California at Berkeley!

102! Department of Physics Stanford !University California!

116! Lawrence Berkeley National Labs!<ref>California</ref>!

X  

4  

Microsoft Academic Search!

!!

University of California at Berkeley!Computer Science Division!University of California at Berkeley!Computer Science Division!University of California at Berkeley!

Department of Physics Stanford !University California!

5  

•  Data cleaning in BDAS.!– Problem 1. Scale!– Problem 2. Latency!!

•  Sampling to cope with scale.!•  Asynchrony to cope with latency.!

!!

Enter SampleClean!

6  

Now it’s your turn!!

!!

Be the crowd and help us decide!

7  

Dirty Data is Ubiquitous!

8!

Example: Missing, incomplete, inconsistent data!

Data Cleaning is Hard!

9  

Time consuming!

Data Cleaning is Hard!

10  

Costly!Time consuming!

Data Cleaning is Hard!

11  

Costly!Domain-specific!

Time consuming!

Data Cleaning is Hard!

12  

Costly!Domain-specific!

Time consuming!

Analy0cs  

13  

A New Data Cleaning Architecture!

Data  Cleaning  Data  

Analy0cs  

14  

A New Data Cleaning Architecture!

Data  Cleaning  

Data  

Can it Scale?!

Crowd   Machine  Learning  

Regex  

Time  

People are slow and expensive!

15  

Insight 1: Asynchrony Hides Latency!

16  

Insight 2: Sampling Hides Scale!

Query !Error!

Time!

BlinkDB!

17  

Query !Error!

Time!

Data  Error  

BlinkDB!

Insight 2: Sampling Hides Scale!

18  

Query !Error!

Time!

Data  Error  

SampleClean!

Insight 2: Sampling Hides Scale!

BlinkDB!

19  

SampleClean Data Flow!

Dirty  Data   Dirty    Sample  

Clean    Sample  

Query  

Data    Cleaning  

20  

Sampling   Asynchrony  

SampleClean Data Flow!

Clean    Sample  

Query  

Data    Cleaning  

Asynchrony  

21  

The SampleClean Architecture!

Data  Cleaning  Library  

Approximate  Query  Processing  Asynchronous  

Pipelines  

Clean  Sample  

Issue Queries, ! Get Results!

Declare Cleaning !Operations!

Dirty  Sample  

22  

The SampleClean Architecture!

Data  Cleaning  Library  

Approximate  Query  Processing  Asynchronous  

Pipelines  

Clean  Sample  

Issue Queries, ! Get Results!

Declare Cleaning !Operations!

Dirty  Sample  

23  

Approximate Query Processing!•  Estimate early results and bound with

error bars!

SampleClean: Fast and Accurate Query Processing on Dirty Data. SIGMOD 2014!!BlinkDB: Queries with Bounded Errors and Bounded Response Times on Very Large Data. EuroSys 2013!

Query !Error!

Time!

24  

The SampleClean Architecture!

25  

Approximate  Query  Processing  Asynchronous  

Pipelines  

Clean  Sample  

Issue Queries, ! Get Results!

Declare Cleaning !Operations!

Dirty  Sample  

Data  Cleaning  Library  

Crowds and Machines Work Together!

•  Extensible library of data cleaning tools!•  Tools are:!– Automated!– Human-powered!– Hybrid!

!

Crowd   Machine  Learning  

Regex  

Time  

26  

Active Learning and Crowds!•  Choose informative training points!

Not !Informative!

Are these the same?!Stanford Department of IEOR!!UC Berkeley Stats!! ¢ Yes ! ¤ No!

Informative!

Are these the same?!Department of Mathematics Stanford University!!University of California Berkeley Department of Mathematics!! ¢ Yes ! ¤ No!

27  

Active Learning and Crowds!•  Choose informative training points!

Not !Informative!

Are these the same?!Stanford Department of IEOR!!UC Berkeley Stats!! ¢ Yes ! ¤ No!

Informative!

Are these the same?!Department of Mathematics Stanford University!!University of California Berkeley Department of Mathematics!! ¢ Yes ! ¤ No!

28  

The SampleClean Architecture!

29  

Data  Cleaning  Library  

Clean  Sample  

Issue Queries, ! Get Results!

Declare Cleaning !Operations!

Dirty  Sample  

Approximate  Query  Processing  Asynchronous  

Pipelines  

Putting it all together: Asynchronous Pipelines!

•  Users group data cleaning operations into pipelines!

30  

The SampleClean Architecture!

Data  Cleaning  Library  

Approximate  Query  Processing  Asynchronous  

Pipelines  

Clean  Sample  

Issue Queries, ! Get Results!

Declare Cleaning !Operations!

Dirty  Sample  

31  

Great, Now What?!•  Prototype implementation complete!!

•  Significant research challenges remain:!•  Crowd worker performance and quality!•  Pipeline semantics and optimization!•  Programming model and interface!!

•  Open source release targeted for next year!

32  

Summary!•  Data Cleaning is slow, costly, and

domain-specific!

•  SampleClean brings data cleaning into the BDAS stack !

•  SampleClean uses asynchrony to hide latency, and sampling to hide scale!

•  SampleClean combines Algorithms, Machines, and People, all in one system! 33  

Asynchrony in Spark!•  The Spark abstraction: blocking BSP!

•  So how do we achieve asynchrony?!•  Multithreaded master!•  Intermediate results materialized in

Hive!•  Standalone Finagle HTTP server for

crowd work!

!34  

Recommended