Upload
zuhair-khayyat
View
52
Download
1
Embed Size (px)
Citation preview
ScalingBigDataCleansingPHD DEFENSEOF: ZUHAIR KHAYYAT
MAY,2017
WhatisDataCleansing?q Datacleansingistheprocessof:
A. detectingerrorinrecordsets,tables,ordatabases(violationdetection)
B. andfixingthem(violation repair)
q Exampleerrorsindata:
• Typos • Duplicate • Values inconsistent withbusinessrules
• Outliers • Outdated • Missingvalues
May16,2017 2/73
WhyDataCleansingisImportant?q 25%ofworld'scriticaldataaredirty
q 60%- 98%ofthedatascientist'stimeislostintheprocessdatacleansing
q “duplicateanddirtydatacoststhehealthcareindustryover$300billion
everyyear”-- JoeFusaro (RingLead)
q “inaccuratedatahasadirectimpact...theaveragecompanylosing12%ofits
revenue”-- BenDavis(Econsultancy)
May16,2017 3/73
ExampleofaDirtyDatasetACompanyemployeedatabase:
q Rule1:AnytwoemployeesinsameZipcode mustbeinsameCity
q Rule2:Anemployeewhoearnshighersalarymustpaymoretaxescomparedtoothers
Name Zipcode City State Salary Rate
t1
Annie 10001 NY NY 24000 15
t2
Laure 90210 LA CA 25000 10
t3
John 60601 CH IL 40000 25
t4
Mark 90210 SF CA 88000 28
t5
Robert 60827 CH IL 15000 15
t6
Mary 90210 LA CA 81000 28
May16,2017 4/73
TheProcessofDataCleansing
Name Zipcode City State Salary Rate
t1 Annie 10001 NY NY 24000 15
t2 Laure 90210 LA CA 25000 10
t3 John 60601 CH IL 40000 25
t4 Mark 90210 SF CA 88000 28
t5 Robert 60827 CH IL 15000 15
t6 Mary 90210 LA CA 81000 28
Detection
RulesInputData
Dirty
1st:Detect
Name Zipcode City State Salary Rate
t1 Annie 10001 NY NY 24000 15
t2 Laure 90210 LA CA 25000 10
t3 John 60601 CH IL 40000 25
t4 Mark 90210 SF CA 88000 28
t5 Robert 60827 CH IL 15000 15
t6 Mary 90210 LA CA 81000 28
Dirty
3rd:Repair
Name Zipcode City State Salary Rate
t1 Annie 10001 NY NY 24000 15
t2 Laure 90210 LA CA 25000 10
t3 John 60601 CH IL 40000 25
t4 Mark 90210 SF CA 88000 28
t5 Robert 60827 CH IL 15000 15
t6 Mary 90210 LA CA 81000 28
Dirty
4th:UpdateInputData
Name Zipcode City State Salary Rate
t1 Annie 10001 NY NY 24000 15
t2 Laure 90210 LA CA 25000 10
t3 John 60601 CH IL 40000 25
t4 Mark 90210 SF CA 88000 28
t5 Robert 60827 CH IL 15000 15
t6 Mary 90210 LA CA 81000 28
DirtyDirty
Name Zipcode City State Salary Rate
t1 Annie 10001 NY NY 24000 15
t2 Laure 90210 LA CA 25000 10
t3 John 60601 CH IL 40000 25
t4 Mark 90210 SF CA 88000 28
t5 Robert 60827 CH IL 15000 15
t6 Mary 90210 LA CA 81000 28
CleanData
2st:Analyze
May16,2017 5/73
WhyDirtyDataisStillaProblem?q Dataisgrowingata40%
compoundannualrate
q Source:Oracle,2012,
https://goo.gl/uHd4uR
≈15Zettabytes
May16,2017 6/73
4th:UpdateInputData
Name Zipcode City State Salary Rate
t1 Annie 10001 NY NY 24000 15
t2 Laure 90210 LA CA 25000 10
t3 John 60601 CH IL 40000 25
t4 Mark 90210 SF CA 88000 28
t5 Robert 60827 CH IL 15000 15
t6 Mary 90210 LA CA 81000 28
DirtyDirty
Name Zipcode City State Salary Rate
t1 Annie 10001 NY NY 24000 15
t2 Laure 90210 LA CA 25000 10
t3 John 60601 CH IL 40000 25
t4 Mark 90210 SF CA 88000 28
t5 Robert 60827 CH IL 15000 15
t6 Mary 90210 LA CA 81000 28
CleanData
Name Zipcode City State Salary Rate
t1 Annie 10001 NY NY 24000 15
t2 Laure 90210 LA CA 25000 10
t3 John 60601 CH IL 40000 25
t4 Mark 90210 SF CA 88000 28
t5 Robert 60827 CH IL 15000 15
t6 Mary 90210 LA CA 81000 28
Dirty
Detection
RulesInputData
ProblemsofBigDataCleansing
1st:Detect
Name Zipcode City State Salary Rate
t1 Annie 10001 NY NY 24000 15
t2 Laure 90210 LA CA 25000 10
t3 John 60601 CH IL 40000 25
t4 Mark 90210 SF CA 88000 28
t5 Robert 60827 CH IL 15000 15
t6 Mary 90210 LA CA 81000 28
Dirty
3rd:Repair
Name Zipcode City State Salary Rate
t1 Annie 10001 NY NY 24000 15
t2 Laure 90210 LA CA 25000 10
t3 John 60601 CH IL 40000 25
t4 Mark 90210 SF CA 88000 28
t5 Robert 60827 CH IL 15000 15
t6 Mary 90210 LA CA 81000 28
Dirty
2st:Analyze
≈90%Runtime
MostofResearch
0
20
40
60
80
100
1% 5% 10% 50%
Tim
e (S
econ
ds)
Violation percentage
Violation detectionData repair
May16,2017 7/73
ProblemsofBigDataCleansing
Name Zipcode City State Salary Rate
t1 Annie 10001 NY NY 24000 15
t2 Laure 90210 LA CA 25000 10
t3 John 60601 CH IL 40000 25
t4 Mark 90210 SF CA 88000 28
t5 Robert 60827 CH IL 15000 15
t6 Mary 90210 LA CA 81000 28
Dirty
1st:Detect
Name Zipcode City State Salary Rate
t1 Annie 10001 NY NY 24000 15
t2 Laure 90210 LA CA 25000 10
t3 John 60601 CH IL 40000 25
t4 Mark 90210 SF CA 88000 28
t5 Robert 60827 CH IL 15000 15
t6 Mary 90210 LA CA 81000 28
Dirty
3rd:Repair
Name Zipcode City State Salary Rate
t1 Annie 10001 NY NY 24000 15
t2 Laure 90210 LA CA 25000 10
t3 John 60601 CH IL 40000 25
t4 Mark 90210 SF CA 88000 28
t5 Robert 60827 CH IL 15000 15
t6 Mary 90210 LA CA 81000 28
Dirty
4th:UpdateInputData
Name Zipcode City State Salary Rate
t1 Annie 10001 NY NY 24000 15
t2 Laure 90210 LA CA 25000 10
t3 John 60601 CH IL 40000 25
t4 Mark 90210 SF CA 88000 28
t5 Robert 60827 CH IL 15000 15
t6 Mary 90210 LA CA 81000 28
DirtyDirty
Name Zipcode City State Salary Rate
t1 Annie 10001 NY NY 24000 15
t2 Laure 90210 LA CA 25000 10
t3 John 60601 CH IL 40000 25
t4 Mark 90210 SF CA 88000 28
t5 Robert 60827 CH IL 15000 15
t6 Mary 90210 LA CA 81000 28
CleanData
2st:Analyze
Detection
RulesInputData
1. Violationdetectionbecomestooexpensivewithbigdata:
a. Enumeratingalltuplesisnotpossible
b. Notfeasibletoimplementaparallelversionofeachdetectionrule
c. Serialrepairalgorithmscannothandlebigerrors
May16,2017 8/73
ProblemsofBigDataCleansing
Name Zipcode City State Salary Rate
t1 Annie 10001 NY NY 24000 15
t2 Laure 90210 LA CA 25000 10
t3 John 60601 CH IL 40000 25
t4 Mark 90210 SF CA 88000 28
t5 Robert 60827 CH IL 15000 15
t6 Mary 90210 LA CA 81000 28
Dirty
1st:Detect
Name Zipcode City State Salary Rate
t1 Annie 10001 NY NY 24000 15
t2 Laure 90210 LA CA 25000 10
t3 John 60601 CH IL 40000 25
t4 Mark 90210 SF CA 88000 28
t5 Robert 60827 CH IL 15000 15
t6 Mary 90210 LA CA 81000 28
Dirty
3rd:Repair
Name Zipcode City State Salary Rate
t1 Annie 10001 NY NY 24000 15
t2 Laure 90210 LA CA 25000 10
t3 John 60601 CH IL 40000 25
t4 Mark 90210 SF CA 88000 28
t5 Robert 60827 CH IL 15000 15
t6 Mary 90210 LA CA 81000 28
Dirty
4th:UpdateInputData
Name Zipcode City State Salary Rate
t1 Annie 10001 NY NY 24000 15
t2 Laure 90210 LA CA 25000 10
t3 John 60601 CH IL 40000 25
t4 Mark 90210 SF CA 88000 28
t5 Robert 60827 CH IL 15000 15
t6 Mary 90210 LA CA 81000 28
DirtyDirty
Name Zipcode City State Salary Rate
t1 Annie 10001 NY NY 24000 15
t2 Laure 90210 LA CA 25000 10
t3 John 60601 CH IL 40000 25
t4 Mark 90210 SF CA 88000 28
t5 Robert 60827 CH IL 15000 15
t6 Mary 90210 LA CA 81000 28
CleanData
2st:Analyze
Detection
RulesInputData
2. Complexerrordiscoveryrulesbasedoninequalityconditionsaretooexpensive:
Rule2:Anemployeewhoearnshighersalarymustpaymoretaxescomparedtoothers
è (ti.salary <t
j.salary)AND(t
i.tax >t
j.tax)
May16,2017 9/73
ProblemsofBigDataCleansing
Name Zipcode City State Salary Rate
t1 Annie 10001 NY NY 24000 15
t2 Laure 90210 LA CA 25000 10
t3 John 60601 CH IL 40000 25
t4 Mark 90210 SF CA 88000 28
t5 Robert 60827 CH IL 15000 15
t6 Mary 90210 LA CA 81000 28
Dirty
1st:Detect
Name Zipcode City State Salary Rate
t1 Annie 10001 NY NY 24000 15
t2 Laure 90210 LA CA 25000 10
t3 John 60601 CH IL 40000 25
t4 Mark 90210 SF CA 88000 28
t5 Robert 60827 CH IL 15000 15
t6 Mary 90210 LA CA 81000 28
Dirty
3rd:Repair
Name Zipcode City State Salary Rate
t1 Annie 10001 NY NY 24000 15
t2 Laure 90210 LA CA 25000 10
t3 John 60601 CH IL 40000 25
t4 Mark 90210 SF CA 88000 28
t5 Robert 60827 CH IL 15000 15
t6 Mary 90210 LA CA 81000 28
Dirty
4th:UpdateInputData
Name Zipcode City State Salary Rate
t1 Annie 10001 NY NY 24000 15
t2 Laure 90210 LA CA 25000 10
t3 John 60601 CH IL 40000 25
t4 Mark 90210 SF CA 88000 28
t5 Robert 60827 CH IL 15000 15
t6 Mary 90210 LA CA 81000 28
DirtyDirty
Name Zipcode City State Salary Rate
t1 Annie 10001 NY NY 24000 15
t2 Laure 90210 LA CA 25000 10
t3 John 60601 CH IL 40000 25
t4 Mark 90210 SF CA 88000 28
t5 Robert 60827 CH IL 15000 15
t6 Mary 90210 LA CA 81000 28
CleanData
Detection
RulesInputData
3. Errorgraph(violationgraph)israndom,bigandunpredictable:
• Irregularstructures
• Skeweddistributions
• Unpredictableworkloadofalgorithm
2st:Analyze
May16,2017 10/73
Problems&SolutionsofBigDataCleansingProblems
1. Violationdetectionbecomestoo
expensivewithbigdata
2. Complexerrordiscoveryrulesbasedon
inequalityconditionsaretooexpensive
3. Errorgraph(violationgraph)israndom,
bigandunpredictable
• Developageneralpurpose
scalabledatacleansing
systemBigDansing
• Introducenewjoinalgorithm
toenhanceinequalityjoinsIEJoin
• Buildageneralgraphsystem
thatadaptstovariousgraph
structuresandalgorithmsMizan
Solutions
May16,2017 11/73
BigDansingASystemforBigDataCleansing
Name Zipcode City State Salary Rate
t1 Annie 10001 NY NY 24000 15
t2 Laure 90210 LA CA 25000 10
t3 John 60601 CH IL 40000 25
t4 Mark 90210 SF CA 88000 28
t5 Robert 60827 CH IL 15000 15
t6 Mary 90210 LA CA 81000 28
Dirty
May16,2017 12/73
Relatedwork?
NADEEF*
* M.Dallachiesa,etal.,“NADEEF:ACommodity
DataCleaningSystem,”inSIGMOD2013
DBMS
UDF
Declarative
Rulesü Easy-to-use
ü Extensible
ü Efficient
☓ Scalable(SingleMachine)
May16,2017 13/73
WhatdoesBigDataCleansingrequire?1. ScaleDetection
§ Declarativerules
Ø Functionaldependencies(FDs,CFDs)
Ø Denialconstraints(DCs)
§ Userdefinedfunctions
2. ScaleRepairs
§ Handle serialrepairalgorithms
May16,2017 14/73
BigDansing – ScalingViolationDetectionFunctional
dependenciesDenial
constraintsEntity
resolutionInclusion
dependencies
DomainSpecificLanguage
Scope Block Iterate Detect GenFix
May16,2017 15/73
BigDansing – Input
UDFScope
Block
Iterate
Detect
GenFixViolationDetectionPlan (LogicalPlan)
Rule
Parser
Declarative
Rules
May16,2017 16/73
BigDansing – PlanConversionandOptimizationLogicalPlan
PhysicalPlan
ExecutionPlan
May16,2017 17/73
Rule1– LogicalPlan§ AnytwoemployeesinsameZipcode mustbeinsameCity
§ FD:Zipcodeà City
Scope Block Iterate Detect GenFix
Scope(Zipcode,City) Block(Zipcode) Iterate Detect(Cityi ≠Cityj) GenFix
LogicalOperators
May16,2017 18/73
Rule1– PhysicalPlan§ AnytwoemployeesinsameZipcode mustbeinsameCity
§ FD:Zipcodeà City
PScope PBlock PIterate PDetect PGenFix CoBlock UCrossProduct
PScope PBlock PIterate PDetect PGenFix
Scope(Zipcode,City) Block(Zipcode) Iterate Detect(Cityi ≠Cityj) GenFix
PhysicalOperators
May16,2017 19/73
Rule1– ExecutionPlan§ AnytwoemployeesinsameZipcode mustbeinsameCity
§ FD:Zipcodeà City
Spark-
PScope
Spark-
PBlock
Spark-
PIterate
Spark-
PDetect
Spark-
PGenFix
PScope PBlock PIterate PDetect PGenFix
Scope(Zipcode,City) Block(Zipcode) Iterate Detect(Cityi ≠Cityj) GenFix
May16,2017 20/73
Rule1– ExecutionExampleScope(Zipcode,City) Block(Zipcode) Iterate Detect(Cityi ≠Cityj) GenFix
Name Zipcode City State Salary Rate
t1
Annie 10001 NY NY 24000 15
t2
Laure 90210 LA CA 25000 10
t3
John 60601 CH IL 40000 25
t4
Mark 90210 SF CA 88000 28
t5Robert 60827 CH IL 15000 15
t6
Mary 90210 LA CA 81000 28
Zipcode City
t1
10001 NY
t2
90210 LA
t3
60601 CH
t4
90210 SF
t5
60827 CH
t6
90210 LA
Zipcode City
t1
10001 NY
t2
90210 LA
t4
90210 SF
t6
90210 LA
t3
60601 CH
t5
60827 CH
(t2,t4)
(t2,t6)
(t4,t6)
(t2,t4)
(t4,t6)
t2[City]=t4[City]
t4[City]=t6[City]
1)Scope 3)Iterate2)Block
4)Detect
5)GenFix
May16,2017 21/73
Rule2– LogicalPlan§Anemployeewhoearnshighersalarymustpaymoretaxescomparedtoothers
§ DC:� ti,t
j� D,¬(t
i.Salary <t
j.Salary ˄t
i.Rate >t
j.Rate)
Name Zipcode City State Salary Rate
t1
Annie 10001 NY NY 24000 15
t2
Laure 90210 LA CA 25000 10
t3
John 60601 CH IL 40000 25
t4
Mark 90210 SF CA 88000 28
t5Robert 60827 CH IL 15000 15
t6
Mary 90210 LA CA 81000 28
• ForAnnie,compareSalarywith:
• Laure
• John
• Mark
• Robert
• Mary
CompareRate
CompareRate
CompareRate
CompareRate
ReportaViolation!
May16,2017 22/73
Rule2– LogicalPlan§Anemployeewhoearnshighersalarymustpaymoretaxescomparedtoothers
§ DC:� ti,t
j� D,¬(t
i.Salary <t
j.Salary ˄t
i.Rate >t
j.Rate)
Scope(Salary,Rate) IterateDetect(ti.Salary <tj.Salary ˄
ti.Rate >tj.Rate)GenFix
Scope Block Iterate Detect GenFix
LogicalOperators
May16,2017 23/73
Rule2– PhysicalPlan§Anemployeewhoearnshighersalarymustpaymoretaxescomparedtoothers
§ DC:� ti,t
j� D,¬(t
i.Salary <t
j.Salary ˄t
i.Rate >t
j.Rate)
PScope UCrossProduct PDetect PGenFix
PhysicalOperators
Scope(Salary,Rate) IterateDetect(ti.Salary <tj.Salary ˄
ti.Rate >tj.Rate)GenFix
PScope PBlock PIterate PDetect PGenFix CoBlock UCrossProduct
May16,2017 24/73
Rule2– ExecutionPlan§Anemployeewhoearnshighersalarymustpaymoretaxescomparedtoothers
§ DC:� ti,t
j� D,¬(t
i.Salary <t
j.Salary ˄t
i.Rate >t
j.Rate)
PScope UCrossProduct PDetect PGenFix
Scope(Salary,Rate) IterateDetect(ti.Salary <tj.Salary ˄
ti.Rate >tj.Rate)GenFix
Spark-
PScope
Spark-
UCrossProduct
Spark-
PDetect
Spark-
PGenFix
May16,2017 25/73
PlanOptimizations– OCJoin§ DC:� t
i,t
j� D,¬(t
i.Salary <t
j.Salary ˄t
i.Rate >t
j.Rate)
Range
Partitioning
Sorting
Pruning
Joining
Partition1 Partition2 Partition3 Partitionn Basedon
Salary
Basedon
RatePartition1 Partition2 Partition3 Partitionn
Partition1
Partition2
Partition3Partition4
Partition5
Partition6 Partitionn
Partition2 Partition3 Partition5 Partition6⨝ ⨝May16,2017 26/73
Rule2– ExecutionPlan§ Rule2:Anemployeewhoearnshighersalarymustpaymoretaxescomparedtoothers
§ DC:� ti,t
j� D,¬(t
i.Salary <t
j.Salary ˄t
i.Rate >t
j.Rate)
PScopeOCJoin(ti.Salary <tj.Salary ˄
ti.Rate >tj.Rate)PDetect PGenFIx
Scope(Salary,Rate) IterateDetect(ti.Salary <tj.Salary ˄
ti.Rate >tj.Rate)GenFIx
Spark-
PScopeSpark-OCJoin
Spark-
PDetect
Spark-
PGenFIx
May16,2017 27/73
WhatdoesBigDataCleansingrequire?1. ScaleDetection
§ Declarativerules
Ø Functionaldependencies(FDs,CFDs)
Ø Denialconstraints(DCs)
§ Userdefinedfunctions
2. ScaleRepairs
§ Handle serialrepairalgorithms
!
May16,2017 28/73
Rule1:Zipcodeà City
Rule2:� t1,t2� D,¬(t1.Salary<t2.Salary˄t1.Rate>t2.Rate)
BigDansing – StructureoftheViolationGraph
Name Zipcode City State Salary Rate
t1
Annie 10001 NY NY 24000 15
t2
Laure 90210 LA CA 25000 10
t3
John 60601 CH IL 40000 25
t4
Mark 90210 SF CA 88000 28
t5
Robert 60827 CH IL 15000 15
t6
Mary 90210 LA CA 81000 28
• Rule1:t2[City]=t
4[City]
• Rule2:t1[Salary]> t
2[Salary]
ORt1[Tax]< t
2[Tax]
May16,2017 29/73
Rule1:Zipcodeà City
Rule2:� t1,t2� D,¬(t1.Salary<t2.Salary˄t1.Rate>t2.Rate)
BigDansing – StructureoftheViolationGraph
t1
t5
t2 t4
t6
R1:City
R1:City
R2:Salary,Tax• Rule1:t2[City]=t
4[City]
• Rule1:t4[City]=t
6[City]
• Rule2:t1[Salary]> t
2[Salary]ORt
1[Tax]< t
2[Tax]
• Rule2:t5[Salary]> t
2[Salary]ORt
5[Tax]< t
2[Tax]
May16,2017 30/73
BigDansing – DataRepairasaBlackbox
t1
t5
t2 t4
t6
R1:City
R1:City
R2:Salary,Tax t1
t5
t2
R2:Salary,Tax
t2 t4
t6
R1:City
GraphAnalysis
SerialRepair
Algorithm
SerialRepair
Algorithm
SerialRepair
Algorithm
tytxR1:City
tytxR1:City
May16,2017 31/73
BigDansing – ApacheSparkStack
May16,2017 32/73
PerformanceofaSingleMachine
0
1000
2000
3000
4000
5000
6000
100,000 1,000,000 10,000,000
Runt
ime
(Sec
onds
)
Dataset size (rows)
BigDansingNADEEFPostgreSQL
Spark SQLShark
5 18 8655
368
0.26
4
37
3183
4 8 802 47
4153
0
2000
4000
6000
8000
10000
12000
14000
16000
100,000 200,000 300,000Ru
ntim
e (S
econ
ds)
Dataset size (rows)
BigDansingNADEEFPostgreSQL
Spark SQLShark
10 30 62833
4529
9336
2133
8780
3731
7982
Rule1 Rule2
May16,2017 33/73
0
20000
40000
60000
80000
100000
120000
1M 2M 3MTi
me
(Sec
onds
)Dataset size (rows)
BigDansing-SparkSpark SQLShark
1240
5319
7730
0
5000
10000
15000
20000
10M 20M 40M
Tim
e (S
econ
ds)
Dataset size (rows)
BigDansing-SparkBigDansing-HadoopSpark SQLShark
121
150
337
503
865 23
02
159
313
66237
39
1411
3
1268
22
Performanceona16-machineclusterRule1 Rule2
May16,2017 34/73
0
25000
50000
75000
100000
125000
1 2 4 8 16
Runt
ime
(Sec
onds
)#-workers
BigDansing
Spark SQL
0
40000
80000
120000
160000
200000
647M 959M 1271M1583M1907M
Tim
e (S
econ
ds)
Dataset size (rows)
BigDansing-SparkBigDansing-HadoopSpark SQL
712
2307
5113
8670
1188
0
2480
3
5288
6 9223
6 1389
32
1961
33
9263 1787
2
3019
5
4690
7
6511
5
Performanceona16-machinecluster
May16,2017 35/73
DetectingViolationsonRDF
Scope Block1 Iterate1
Block2 Iterate2
Block3 Iterate3
Detect GenFix
May16,2017 36/73
DetectingViolationsonRDF
0
1000
2000
3000
4000
5000
BigDansing
S2RDFBigDansing
S2RDFBigDansing
S2RDFBigDansing
S2RDF
Runt
ime
(Sec
onds
)
Number of RDF triples
Pre-processingViolation Detection
170M85M42M21M*AlexanderSchätzle, etal., “S2RDF:
RDFQueryingwithSPARQLonSpark”,
inPVLDB2016
* * * *
May16,2017 37/73
BigDansing:ASystemforBigDataCleansing
ü Easy-to-use
ü Efficient
ü Extensible
ü Scalable
*ZuhairKhayyat,etal.,“BigDansing:ASystemforBigDataCleansing”,
inSIGMOD2015.
May16,2017 38/73
IEJoinFastandScalableInequalityJoins
Name Zipcode City State Salary Rate
t1 Annie 10001 NY NY 24000 15
t2 Laure 90210 LA CA 25000 10
t3 John 60601 CH IL 40000 25
t4 Mark 90210 SF CA 88000 28
t5 Robert 60827 CH IL 15000 15
t6 Mary 90210 LA CA 81000 28
May16,2017 39/73
OCJoin inBigDansing
0
20000
40000
60000
80000
100000
100,000 200,000 300,000
Runt
ime
(Sec
onds
)
Data size (rows)
OCJoinUCrossProductCross product
97 103
1264279
2291
2
6177
2
4953
2707
8
8252
40
20000
40000
60000
80000
100000
120000
1M 2M 3M
Tim
e (S
econ
ds)
Dataset size (rows)
BigDansing-SparkSpark SQLShark
1240
5319
7730
May16,2017 40/73
WhatistheProblem?q Rule2:� t1,t2� D,¬(t1.Salary<t2.Salary˄t1.Rate>t2.Rate)
§ Select*fromDt1JOINDt2ont1.Salary<t2.SalaryAND t1.Tax>t2.Tax
q ProcessedasaCartesianproduct:O(n2)
May16,2017 41/73
RelatedWorkq BandJoin:
§Basedonapointwithinarange:R.A−c1≤S.B&S.B≤R.A+c2
q Intervaljoinintemporalandspatialdata:notgeneral
q Spatialindexing:
§ Largememoryfootprint
§Expensivepreprocessing
May16,2017 42/73
IEJoin – aNewJoinAlgorithmq Indatacleansing:
§ Q1:Select*fromDt1JOINDt2ont1.Salary>t2.SalaryAND t1.Tax<t2.Tax
q Intervalintersection:
§Q2:SELECT*FROMEventsr,EventssWHEREr.start ≤s.end ANDr.end ≥s.start
q Joiningtableswith(≠):
§Qk:SELECT*FROMEventsr,EventssWHEREr.start ≤s.end ANDr.end ≠s.start
May16,2017 43/73
AlgorithmDiscovery
t3(150) t4(120) t1(100) t2(90)
Q1:Select*fromDt1JOINDt2ont1.Salary<t2.SalaryAND t1.Rate>t2.Rate
SortdescendingonSalary:
Salarypartialanswer:(t2,t
1),(t
2,t
4),(t
2,t
3)….(t
4,t
3)
t3(15) t4(10) t2(9) t1(5)
SortdescendingonRate:
Ratepartialanswer:(t1,t
2),(t
1,t
4),(t
1,t
3)….(t
4,t
3)
Salary Rate
t1 100 5
t2 90 9
t3 150 15
t4 120 10
May16,2017 44/73
AlgorithmDiscovery
Salary Rate
t1 100 5
t2 90 9
t3 150 15
t4 120 10
Q1:Select*fromDt1JOINDt2ont1.Salary<t2.SalaryAND t1.Rate>t2.Rate
Ratepartialanswer:
(t1,t2),(t1,t4),(t1,t3),
(t2,t4),(t2,t3),
(t4,t3)}
Salarypartialanswer:
(t2,t1),(t2,t4),(t2,t3),
(t1,t4),(t1,t3),
(t4,t3)}
Theexpectedresultis:(t2,t1)
May16,2017 45/73
IEJoin – theAlgorithmq SortDescendingonSalary:
q SortDescendingonRate:
Salary Rate
t1 100 5
t2 90 9
t3 150 15
t4 120 10
t3(150) t4(120) t1(100) t2(90) 0 1 2 3
PermutationArray
t3(15) t4(10) t2(9) t1(5) 0 1 3 2
0 0 0 0
t3 t4 t2 t1
1 1 11
Sequentialscan
Randomaccess
Result=(t2,t1)
Bit-Array
May16,2017 46/73
SortingOrdersQ1:Select*fromDt1JOINDt2ont1.Salary<t2.SalaryAND t1.Rate>t2.Rate
q Forselfjoins:
§ Salary:ascending orderifOP1iseither>or≥,otherwisedescending order
§ Rate:descending orderifOP1iseither>or≥,otherwiseascending order
§ Non-selfjoins:
§ Salary:descending orderifOP1iseither>or≥,otherwisedescending order
§ Rate:ascending orderifOP1iseither>or≥,otherwisedescending order
OP1 OP2
May16,2017 47/73
Optimizations– BitmapIndex
0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0
0 1 0 0
C1 C2 C3 C4
(i) pos 6 (ii) pos 9
B
max
May16,2017 48/73
Optimizations– NotEqualOperatorq Converteach(≠)intoone(>)andone(<)joinedwithUNIONALLoperator
Qk:SELECT*FROMEventsr,EventssWHEREr.start ≤s.end ANDr.end ≠s.start
Q’k:SELECT*FROMEventsr,EventssWHEREr.start ≤s.end ANDr.end < s.start
UNIONALL
SELECT*FROMEventsr,EventssWHEREr.start ≤s.end ANDr.end >s.start
May16,2017 49/73
Optimizations– SelectivityEstimationq Aquerywiththreeattributes: r.Salary <s.Salary ANDr.Rate >s.Rate ANDr.Age >s.Ageq Usesamplingtoestimatethemaximumoutputsize– Est(Salary,Rate),Est(Salary,Tax),Est(Tax,Age)
Range
Partitioning
Sorting
Pruning
Calculate
MaxOutput
Partition1 Partition2 Partition3 PartitionnBasedon
OP1
Basedon
OP2Partition1 Partition2 Partition3 Partitionn
Partition1
Partition2
Partition3Partition4
Partition5
Partition6 Partitionn
EstimatedOutput=numberofoverlappingpartitions=2
May16,2017 50/73
IEJoin andBigDansing
May16,2017 51/73
SerialIEJoin vs.NaïveBaseline
0.010.1
110
1001000
10000
10K 50K 100K
Runt
ime
(Sec
onds
)
Input size
PG-IEJoinPG-Original
MonetDBDBMS-X
0.010.1
110
1001000
10000
10K 50K 100K
Runt
ime
(Sec
onds
)Input size
PG-IEJoinPG-Original
MonetDBDBMS-X
Salary-Rate IntervalIntersection
May16,2017 52/73
0
2000
4000
6000
8000
10000
PG-IEJoinPG-GiST
PG-BTreePG-IEJoin
PG-GiSTPG-BTree
Runt
ime
(Sec
onds
)
Indexing QueryingX
146
3928
X
310
6287
Q2Q1
SerialIEJoin vs.PostgreswithIndex– 50MRows
16workers1workers
GiST: Generalized Search Tree
May16,2017 53/73
ParallelandDistributedIEJoin – 100MRows
040008000
120001600020000
Parallel-IEJoin
Distributed-IEJoin
DPG-GiST
DPG-BTree
SparkSQL-SM
SparkSQL
Runt
ime
(Sec
onds
) Indexing QueryingX X X X
4302
1313
040008000
120001600020000
Parallel-IEJoin
Distributed-IEJoin
DPG-GiST
DPG-BTree
SparkSQL-SM
SparkSQL
Runt
ime
(Sec
onds
) Indexing QueryingX X X
4965
1376
Salary-Rate IntervalIntersection
May16,2017 54/73
IEJoinq Anewjoinalgorithm
q Basedonconditions:(<,≤,>,≥,≠)
q Extremelyfastandhighlyscalable
q Utilizessortingandefficientdatastructures
q EasytoimplementintraditionalDBMSanddistributedsystems
*ZuhairKhayyat,etal., “FastandScalableInequalityJoins”,TheVLDBJournal2017,SpecialIssue:BestPapersofVLDB2015
*ZuhairKhayyat ,etal.,“LightningFastandSpaceEfficientInequalityJoins”,inPVLDB2015
May16,2017 55/73
MizanASystemforDynamicLoad
BalancinginLarge-scaleGraph
Processing
May16,2017 56/73
BigDansing’s implementations
Name Zipcode City State Salary Rate
t1 Annie 10001 NY NY 24000 15
t2 Laure 90210 LA CA 25000 10
t3 John 60601 CH IL 40000 25
t4 Mark 90210 SF CA 88000 28
t5 Robert 60827 CH IL 15000 15
t6 Mary 90210 LA CA 81000 28
Detection
RulesInputData
Dirty
1st:Detect
Name Zipcode City State Salary Rate
t1 Annie 10001 NY NY 24000 15
t2 Laure 90210 LA CA 25000 10
t3 John 60601 CH IL 40000 25
t4 Mark 90210 SF CA 88000 28
t5 Robert 60827 CH IL 15000 15
t6 Mary 90210 LA CA 81000 28
Dirty
3rd:Repair
Name Zipcode City State Salary Rate
t1 Annie 10001 NY NY 24000 15
t2 Laure 90210 LA CA 25000 10
t3 John 60601 CH IL 40000 25
t4 Mark 90210 SF CA 88000 28
t5 Robert 60827 CH IL 15000 15
t6 Mary 90210 LA CA 81000 28
Dirty
4th:UpdateInputData
Name Zipcode City State Salary Rate
t1 Annie 10001 NY NY 24000 15
t2 Laure 90210 LA CA 25000 10
t3 John 60601 CH IL 40000 25
t4 Mark 90210 SF CA 88000 28
t5 Robert 60827 CH IL 15000 15
t6 Mary 90210 LA CA 81000 28
DirtyDirty
Name Zipcode City State Salary Rate
t1 Annie 10001 NY NY 24000 15
t2 Laure 90210 LA CA 25000 10
t3 John 60601 CH IL 40000 25
t4 Mark 90210 SF CA 88000 28
t5 Robert 60827 CH IL 15000 15
t6 Mary 90210 LA CA 81000 28
CleanData
BigDansing
Apache
Hadoop
Giraph
Apache
Spark
GraphX
HDFS
2st:Analyze
May16,2017 57/73
Pregel*/Giraph Abstractionq Basedonvertex-centriccomputation
q Abstraction:
§ compute(),combine()&aggregate()
q Synchronousin-memorybulk
synchronousparallel(BSP)
* G. Malewicz, et al., “Pregel: A System for Large-Scale Graph Processing,” inSIGMOD2010
Superstep 1 Superstep 2 Superstep 3
Worker 3
Worker 2
Worker 1
Worker 3
Worker 2
Worker 1
Worker 3
Worker 2
Worker 1
BSP Barrier
May16,2017 58/73
ProblemsofGiraphTheLightSide TheDarkSide
§ Algorithm:
§ Unforeseen
§ Structure:
§ Variable
§ Algorithm:
§ Predictable
§ Structure:
§ Fixed
Errorgraph(violationgraph)israndom,bigandunpredictable
May16,2017 59/73
HowGiraph Optimize Computations1. FasterGraphLoading
§ Simplegraphpartitioning
§ Hash,Range
2. Optimizedforgraphstructure
§ Sophisticatedandexpensive
partitioningtechniques
§ Min-cuts
0
50
100
150
200
250
300
350
LiveJournalkgraph4m68m
arabic-2005Ru
n Ti
me
(Min
)
HashRange
Min-cuts
Theruntimeofasingleiterationis
asfastastheslowestworker
May16,2017 60/73
BehaviorsofDifferentGraphAlgorithms
0.0010.01
0.11
10100
1000
0 10 20 30 40 50 60
In M
essa
ges
(Mill
ions
)
SuperSteps
PageRank - TotalPageRank - Max/W
DMST - TotalDMST - Max/W
PageRankvs.DistributedMinimalSpanningTree
May16,2017 61/73
SourceofImbalanceinGiraph1. Highvertexresponsetime
2. Longtimetoreceiveincomingmessages
3. Longtimetosendoutgoingmessages
Superstep 1
-High vertex response time
Worker 3
Worker 2
Worker 1
Worker 3
Worker 2
Worker 1
BSP Barrier
Superstep 1
-Long time to receive in messages
Worker 3
Worker 2
Worker 1
Worker 3
Worker 2
Worker 1
BSP Barrier
Superstep 1
-Long time to send out messages
Worker 3
Worker 2
Worker 1
Worker 3
Worker 2
Worker 1
BSP Barrier
May16,2017 62/73
Mizan – SolvingtheWorkloadImbalanceqMoveverticesbetweenworkersduringruntime
q PlanningandvertexmigrationswithintheBSPbarrierto
maintaincomputationconsistency
Superstep 1 Superstep 2 Superstep 3
Worker 3
Worker 2
Worker 1
Worker 3
Worker 2
Worker 1
Worker 3
Worker 2
Worker 1
BSP BarrierMigration Barrier Migration Planner
Communicator - DHT
Vertex Compute()
BSP Graph Processor
Storage Manager
HDFS/Local Disks
IO
Mizan Worker
Load Balancer: Migration Planner
May16,2017 63/73
Mizan’s MigrationPlanningSteps1. Identifythesourceofworkloadimbalanceacrossworkers
§ Remoteoutgoingmessages
§ Allincomingmessages
§ Responsetime V1
Worker 2Worker 1Remote Incoming Messages
Remote Outgoing MessagesVertex
Response Time
V3V2V4
Mizan
V5V6
Mizan
Local Incoming Messages
May16,2017 64/73
Mizan’s MigrationPlanningSteps1. Identifythesourceofworkloadimbalanceacrossworkers
2. Selectthemigrationobjectivethroughastatisticalanalysis
§ Optimizeforoutgoingmessages, or
§ Optimizeforincomingmessages,or
§ Optimizeforresponsetime
May16,2017 65/73
Mizan’s MigrationPlanningSteps1. Identifythesourceofworkloadimbalanceacrossworkers
2. Selectthemigrationobjectivethroughastatisticalanalysis
3. Pairover-utilizedworkerswithunder-utilizedones
W7 W2 W1 W5 W8 W4 W0 W6 W3
0 1 2 3 4 5 6 7 8
W9
May16,2017 66/73
Mizan’s MigrationPlanningSteps1. Identifythesourceofworkloadimbalanceacrossworkers
2. Selectthemigrationobjectivethroughastatisticalanalysis
3. Pairover-utilizedworkerswithunder-utilizedones
4. Selectverticestomigrate
§ Selecttheleastnumberofverticesthathasthehighestimpact
§ Vertexownership:distributedhashtable(DHT)
§ Delayedmigration:reducemigrationcost
May16,2017 67/73
05
10152025303540
Stat
icW
SM
izan
Stat
icW
SM
izan
Stat
icW
SM
izan
Runt
ime
(Min
)
MetisRangeHash
PerformanceofMizan onPageRank
May16,2017 68/73
0
50
100
150
200
250
300
AdvertismentDMST
Runt
ime
(Min
)
StaticWork Stealing
Mizan
PerformanceofMizan withMetis
May16,2017 69/73
Mizan – aGeneralGraphProcessingSystemq APregel-clone
§ Supportsverylargegraphs
§ Runsonverylargeclusters
q Dynamicfine-grainedvertexmigrationsto
balancecomputationandcommunication
q Optimizedforpredictableandnon-
predictablegraphalgorithmsandstructures
BigDansing
Apache
Spark
Mizan
GraphX
*ZuhairKhayyat,etal.,“Mizan:ASystemforDynamicLoad
BalancinginLarge-scaleGraphProcessing”,inEuroSys 2013
GiraphHDFS
May16,2017 70/73
Summary• Ageneralsystemforbigdatacleansing
• Performanceupto2ordersofmagnitudefaster
• SIGMOD2015
§ Anovelalgorithmforfastinequalityjoins
§ Performanceleast2ordersofmagnitude
faster
§ PVLDB2015&VLDBJ2017
§ Ageneralsystemfordistributedgraph
processing
§ Performanceimprovementsupto84%
§ EuroSys 2013
May16,2017 71/73
Publications" ZuhairKhayyat,WilliamLucia,Meghna Singh,Mourad Ouzzani,PaoloPapotti,Jorge-ArnulfoQuiané-Ruiz,NanTang,Panos
Kalnis,“FastandScalableInequalityJoins”,TheVLDBJournal2017 specialissue:BestPapersofVLDB2015.
" Divy Agrawal,Lamine Ba,LaureBerti-Equille,SanjayChawla,AhmedElmagarmid,Hossam Hammady,YasserIdris,Zoi
Kaoudi,ZuhairKhayyat, SebastianKruse,Mourad Ouzzani,PaoloPapotti,Jorge-ArnulfoQuiané-Ruiz,NanTang,MohammedJ.
Zaki,“Rheem:EnablingMulti-PlatformTaskExecution”,inSIGMOD2016.
" ZuhairKhayyat,WilliamLucia,Meghna Singh,Mourad Ouzzani,PaoloPapotti,Jorge-ArnulfoQuiané-Ruiz,NanTang,Panos
Kalnis,“LightningFastandSpaceEfficientInequalityJoins”,inPVLDB2015.
" ZuhairKhayyat,Ihab F.Ilyas,Alekh Jindal,SamuelMadden,Mourad Ouzzani,PaoloPapotti,Jorge-ArnulfoQuiané-Ruiz,Nan
Tang,SiYin,“BigDansing:ASystemforBigDataCleansing”,inSIGMOD2015.
" ZuhairKhayyat,KarimAwara,AmaniAlonazi,HaniJamjoom,DanWilliams,Panos Kalnis,“Mizan:ASystemforDynamicLoadBalancinginLarge-scaleGraphProcessing”,inEuroSys 2013.
May16,2017 73/73