Lecture 25 EDA - GitHub Pages · Lecture 25: EDA Theodoros Rekatsinas 1. Data Visualizations Today...

Preview:

Citation preview

CS639:DataManagementfor

DataScienceLecture25:EDA

TheodorosRekatsinas

1

DataVisualizationsToday

2

DataVisualizationsToday

3

4

StandardDataVisualizationRecipe:

1. Load datasetintodataviz tool2. Start withadesiredhypothesis/pattern(explore

combinationofattributes)3. Select viz tobegenerated4. See ifitmatchesdesiredpattern5. Repeat 3-4untilyoufindamatch

5

TediousandTime-consuming!

KeyIssue:

Visualizationcanbegeneratedby:varyingsubsetsofdatavaryingattributesbeingvisualized

Toomanyvisualizationtolookattofinddesiredvisualpatterns!

1.Visualizationrecommendations

6

Whatyouwilllearnaboutinthissection

1. SpaceofVisualizations

2. RecommendationMetrics

7

Goal

Givenadatasetandatask,automaticallyproduceasetofvisualizationsthatarethemost“interesting”giventhetask

8

Particularlyvague

Goal

Givenadatasetandatask,automaticallyproduceasetofvisualizationsthatarethemost“interesting”giventhetask

9

Example

10

• Dataanalyststudyingcensusdata• age,education,marital-status,sex,race,income,hours-workedetc.

• A =#attributesintable

• Task:Compareonvarioussocioeconomicindicators,unmarriedadultsvs.alladults

Spaceofvisualizations

11

Forsimplicity,assumeasingletable(starschema)

Visualizations=agg.+grp.byqueries

Vi=SELECTd,f(m)FROMtableWHERE___GROUPBYd

(d,m,f):dimension,measure,aggregate

1.5

2

2.5

3

3.5

4

4.5

5010 10

30

MA CA IL NY

Spaceofvisualizations

12

Vi=SELECTd,f(m)FROMtableWHERE___GROUPBYd

(d,m,f):dimension,measure,aggregate{d} :race,work-type,sexetc.{m} :capital-gain,capital-loss,hours-per-week{f} :COUNT,SUM,AVG

Goal

Givenadatasetandatask,automaticallyproduceasetofvisualizationsthatarethemost“interesting”giventhetask

13

Interestingvisualizations

14

Deviation-basedUtility

Avisualizationisinterestingifitdisplaysalargedeviationfromsomereference

Task:compareunmarriedadultswithalladults

50

10 10

30

MA CA IL NY

3020

10

40

MA CA IL NY

V1

V2

V1

V2

Compareinduced

probabilitydistributions!

Target Reference

V1=SELECTd,f(m)FROMtableWHEREtargetGROUPBYdV2=SELECTd,f(m)FROMtableWHEREreferenceGROUPBYd

Deviation-basedUtilityMetric

15

Avisualizationisinterestingifitdisplaysalargedeviationfromsomereference

Manymetricsforcomputingdistancebetweendistributions

V1

V2

D[P(V1),P(V2)]

Earthmover’sdistanceL1,L2distanceK-Ldivergence

Anydistancemetricb/ndistributionsisOK!

ComputingExpectedTrend

16

Racevs.AVG(capital-gain)ReferenceTrendSELECTrace,AVG(capital-gain)FROMcensusGROUPBYrace

P(V1)

Expected

Distribution

ComputingActualTrend

17

Racevs.AVG(capital-gain)TargetTrendSELECTrace,AVG(capital-gain)FROMcensusGROUPBYraceWHERE marital-status=‘unmarried’

P(V2)

Actual

Distribution

ComputingUtility

18

U=D[P(V1) ,P(V2)]D =EMD,L2etc.

LowUtilityVisualization

19

Actual

Expected

HighUtilityVisualization

20

Actual

Expected

Othermetrics

21

• Datacharacteristics• TaskorInsight• SemanticsandDomainKnowledge• VisualEaseofUnderstanding• UserPreference

2.DB-inspiredOptimizations

22

Whatyouwilllearnaboutinthissection

1. RankingVisualizations

2. Optimizations

23

Ranking

24

Acrossall(d,m,f),whereV1=SELECTd,f(m)FROMtableWHEREtargetGROUPBYd

V2=SELECTd,f(m)FROMtableWHEREreferenceGROUPBYd

Vi=(d:dimension,m:measure,f:aggregate)

10sofdimensions,10sofmeasures,handfulofaggregates

2*d*m*f

è100sofqueriesforasingleusertask!

èCanbeevenlarger.How?

Goal:returnkbestutilityvisualizations(d,m,f),(thosewithlargestD[V1,V2])

Evenlargerspaceofqueries

25

• Binning• 3dimensionalor4dimensionalvisualizations• Scatterplotormapvisualizations• …

Backtoranking

26

Acrossall(d,m,f),whereV1=SELECTd,f(m)FROMtableWHEREtargetGROUPBYd

V2=SELECTd,f(m)FROMtableWHEREreferenceGROUPBYd

Goal:returnkbestutilityvisualizations(d,m,f),(thosewithlargestD[V1,V2])

NaïveApproachForeach(d,m,f)insequence

evaluatequeriesforV1(target),V2(reference)computeD[V1,V2]

Returnthek(d,m,f)withlargestDvalues

IssueswithNaïveApproach

27

•Repeatedprocessingofsamedatainsequenceacrossqueries•Computationwastedonlow-utilityvisualizations

Sharing

Pruning

Optimizations

28

• Eachvisualization=2SQLqueries

• Latency>100s• Minimizenumberofqueriesandscans

0

100

200

300

400

500

50 100 250views

late

ncy

(s)

dbmsCOLROW

Optimizations

29

• Combineaggregatequeriesontargetandref

• Combinemultipleaggregates(d1,m1,f1),(d1,m2,f1)à (d1,[m1,m2],f1)

• Combinemultiplegroup-bys*(d1,m1,f1),(d2,m1,f1)à ([d1,d2],m1,f1)Couldbeproblematic…

• ParallelQueryExecution

CombiningMultipleGroup-by’s

30

• Toofewgroup-bys leadstomanytablescans

• Toomanygroup-bys hurtperformance• #groups=Π (#distinctvaluesperattributes)

• Optimalgroup-bycombination≈bin-packing• Binvolume=logS(maxnumberofgroups)• Volumeofitems(attributes)=log(|ai|)• Minimize#binss.t.

Σi log(|ai|)<=logS

Pruningoptimizations

31

• Keeprunningestimatesofutility• Prunevisualizationsbasedonestimates

• Twoflavors• VanillaConfidenceIntervalbasedPruning• Multi-armedBanditPruning

Discardlow-utilityviewsearlytoavoidwastedcomputation

32

Visualizations

Queries(100s)

Sharing

Pruning

Optimizer

DBMS

MiddlewareLayer

Moreonautomatedvisualizations

33

ZQL:aviz explorationlanguage

34

Intelligentqueryoptimizer

35

Summary

36

Humanintheloopanalytics

areheretostay!