19
A Robust Partitioning Scheme for Ad-Hoc Query Workloads ANIL SHANBHAG MIT J/W Alekh Jindal, Sam Madden, Jorge Quiane, Aaron J. Elmore Microsoft MIT QCRI Univ. Chicago

A Robust Partitioning Scheme for Ad-Hoc Query Workloads · Microsoft MIT QCRI Univ. Chicago. Today Data collection is cheap => Lots of data ! Data Partitioning Find average order

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: A Robust Partitioning Scheme for Ad-Hoc Query Workloads · Microsoft MIT QCRI Univ. Chicago. Today Data collection is cheap => Lots of data ! Data Partitioning Find average order

ARobustPartitioningSchemeforAd-HocQueryWorkloads

ANILSHANBHAGMIT

J/WAlekh Jindal,SamMadden, JorgeQuiane, AaronJ.ElmoreMicrosoftMIT QCRI Univ.Chicago

Page 2: A Robust Partitioning Scheme for Ad-Hoc Query Workloads · Microsoft MIT QCRI Univ. Chicago. Today Data collection is cheap => Lots of data ! Data Partitioning Find average order

Today

Datacollectionischeap=>Lotsofdata!

Page 3: A Robust Partitioning Scheme for Ad-Hoc Query Workloads · Microsoft MIT QCRI Univ. Chicago. Today Data collection is cheap => Lots of data ! Data Partitioning Find average order

DataPartitioning

FindaverageordersizeforallordersbetweenSept10andSept11,2017

DataSkipping - Skipdatablocksnotnecessary

10%selectivityquery=>10xfasterifdatapartitionedonselectionpredicate

Orderdate

Page 4: A Robust Partitioning Scheme for Ad-Hoc Query Workloads · Microsoft MIT QCRI Univ. Chicago. Today Data collection is cheap => Lots of data ! Data Partitioning Find average order

TheProblem

Analytics

Ad-Hoc/ExploratoryAnalysis

RecurringWorkloads

+

Focusofexistingwork

Giveworkload=>Returnpartitioninglayout

Problems:1. Tedioustocollectworkload2. Maynotbeknownupfront3. Changesovertime

Howtogetbenefitsofpartitioninginthiscase?

Page 5: A Robust Partitioning Scheme for Ad-Hoc Query Workloads · Microsoft MIT QCRI Univ. Chicago. Today Data collection is cheap => Lots of data ! Data Partitioning Find average order

OurApproach

Doeverythingadaptively!

Twostepprocess:1. Upfrontloadthedatasetpartitioned2. Asusersquery,incrementallyimprovethe

partitioningofthedata

Page 6: A Robust Partitioning Scheme for Ad-Hoc Query Workloads · Microsoft MIT QCRI Univ. Chicago. Today Data collection is cheap => Lots of data ! Data Partitioning Find average order

DistributedstoragesystemslikeHDFS,filesbrokenintoblocks(128MBchunks)

A<=5andB<=7

UpfrontPartitioning

>Insteadofpartitioningbysize,partitionbyattributes.>SamenumberofblockscreatedasinHDFS.Eachblocknowhasadditionalmetadata

Page 7: A Robust Partitioning Scheme for Ad-Hoc Query Workloads · Microsoft MIT QCRI Univ. Chicago. Today Data collection is cheap => Lots of data ! Data Partitioning Find average order

AdaptiveRe-Partitioning

Whenusersubmitsaquery,optimizertriestoimprovethepartitioningbyreorganizingthepartitioningtree

HereifqueriesaskA<=3manytimes,replaceB7 byA3

DoneondatasetswhichareO(1TB)with~8000nodepartitiontrees.

Page 8: A Robust Partitioning Scheme for Ad-Hoc Query Workloads · Microsoft MIT QCRI Univ. Chicago. Today Data collection is cheap => Lots of data ! Data Partitioning Find average order

SystemArchitecturePredicatedScanQueryExample:

FINDemployeesWITHAge<30AND20k<Salary<40k1 2

Page 9: A Robust Partitioning Scheme for Ad-Hoc Query Workloads · Microsoft MIT QCRI Univ. Chicago. Today Data collection is cheap => Lots of data ! Data Partitioning Find average order

1.UpfrontPartitionerGoal:Generateapartitioningtree

WITHOUTanupfrontqueryworkload

>Generatesatreewithheterogeneousbranching

>Balancethepartitioningbenefitacrossallattributes

!

" #

$

! " !

Page 10: A Robust Partitioning Scheme for Ad-Hoc Query Workloads · Microsoft MIT QCRI Univ. Chicago. Today Data collection is cheap => Lots of data ! Data Partitioning Find average order

AllocationGoal: Balancepartitioningbenefitacrossattributes

Allocationofattributei ~averagepartitioningofanattributej

= 𝛴all nodes i nij cij

UpfrontPartitioningAlgorithm

AttributeAllocations

PartitioningTree

UniformifnoworkloadinformationWeightedifwehavepriorworkloadinformation

Page 11: A Robust Partitioning Scheme for Ad-Hoc Query Workloads · Microsoft MIT QCRI Univ. Chicago. Today Data collection is cheap => Lots of data ! Data Partitioning Find average order

2.AdaptiveQueryExecutorGoal:Returnmatchingtuples+checkifpartitioninglayoutcanbeimproved

Alternativesfoundviatransformationsonthepartitioningtree

1.SwapRule

2.PushupRule 3.RotateRule

Page 12: A Robust Partitioning Scheme for Ad-Hoc Query Workloads · Microsoft MIT QCRI Univ. Chicago. Today Data collection is cheap => Lots of data ! Data Partitioning Find average order

Gettingaplan

Page 13: A Robust Partitioning Scheme for Ad-Hoc Query Workloads · Microsoft MIT QCRI Univ. Chicago. Today Data collection is cheap => Lots of data ! Data Partitioning Find average order

CostModelThesystemmaintainawindowWofpastqueries

ComputeBenefitandRepartitioningCostforthebestplan

RepartitioningONLY happenswhenreductioninthetotalcostofthequeryworkloadisgreaterthanre-partitioningcost.

Solvesconstantre-partitioningduetorandomquerysequencesandboundstheworsecaseimpact.

Page 14: A Robust Partitioning Scheme for Ad-Hoc Query Workloads · Microsoft MIT QCRI Univ. Chicago. Today Data collection is cheap => Lots of data ! Data Partitioning Find average order

Performance

4metrics

1)Loadtime

2)Timetakenbyfirstquery

3)Aggregateruntimeoveraworkload

4)Incrementalimprovementwithworkloadhints

Page 15: A Robust Partitioning Scheme for Ad-Hoc Query Workloads · Microsoft MIT QCRI Univ. Chicago. Today Data collection is cheap => Lots of data ! Data Partitioning Find average order

LoadTimeTPC-H:ScaleFactor200+De-normalized.Datasize:1.4TB

Loadingperformance: 1.38timesslowerthanHDFS

Loadtimescalesalmostlinearlywithdatasizeandindependentofnumberofcolumns

Page 16: A Robust Partitioning Scheme for Ad-Hoc Query Workloads · Microsoft MIT QCRI Univ. Chicago. Today Data collection is cheap => Lots of data ! Data Partitioning Find average order

Timetakenbyfirstquery

OnAverage:45%betterthanfullscan20%betterthank-dtree

Page 17: A Robust Partitioning Scheme for Ad-Hoc Query Workloads · Microsoft MIT QCRI Univ. Chicago. Today Data collection is cheap => Lots of data ! Data Partitioning Find average order

AggregateWorkloadRuntime

0400800

120016002000

0400800

120016002000

0400800

120016002000

0 25 50 75 100 125 150 175 2004uery 1o

0400800

1200160020007i

me

7aNe

Q (iQ

s)

full scaQ raQge raQge2 AmoebaWorkload:200Queriesgeneratedfromrandominitializationof8querytemplatesofTPC-Hbenchmark

fullscan – Baseline

range – partitionsonorderdate (1perdate)1.88xbetter

range2– partitionsonorderdate(64),r_name(4),c_mktsegment(4),quantity(8)3.48xbetter

Amoeba– 3.84xbetterthanbaseline

Page 18: A Robust Partitioning Scheme for Ad-Hoc Query Workloads · Microsoft MIT QCRI Univ. Chicago. Today Data collection is cheap => Lots of data ! Data Partitioning Find average order

WorkloadHints

0400800

120016002000

0 25 50 75 100 125 150 175 2004uery 1o

0400800

120016002000

7im

e 7a

NeQ

(iQ s

)

default better iQitBetterInit:Startswithcustomallocationtomimicrange2

6.67xbetterthan fullscan

Filteringratio:default:0.81betterinit :0.9

Page 19: A Robust Partitioning Scheme for Ad-Hoc Query Workloads · Microsoft MIT QCRI Univ. Chicago. Today Data collection is cheap => Lots of data ! Data Partitioning Find average order

Conclusion•Amoeba isadistributedstoragesystembasedonanadaptivedatapartitioningscheme• Lowloadingoverhead• Improvedfirstqueryperformance• Adapttochangesandsignificantlyimprovementtoworkloadruntime• Canexploitworkloadhints

•Allowsanalyststogetstartedrightawayandreapbenefitsofpartitioningwithoutanupfrontworkload