Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
MagdalenaBalazinskaDEPARTMENT OF COMPUTER SCIENCE &ENGINEERING
UNIVERSITY OF WASHINGTONhttp://www.cs.washington.edu/people/faculty/magda
TheMyria BigDataManagementSystemandCloudService
IndustryandSciencesLiveinaData-DrivenWorld
MagdalenaBalazinska- UniversityofWashington 2
Everyonetodayhasabigdataproblem– Whetheritisadatalake,dataswamp,ordatastream– Whethertheycallitbigdata,datascience,datawrangling
PhotobyGaryBridgman/CCBY
BigDataManagement&AnalyticsExcitingandchallengingrequirementsofscientificbigdataappsOftengeneralizebeyondcampus
MagdalenaBalazinska - UniversityofWashington 3
Telescopeimage:1. Iterativedatacleaning2. Objectextraction3. Classification
PicturefromDeepLensSurvey(DLS:Tyson)DatafromtheHumanConnectomeprojectMRIdata:
1. Imageprocessing2. Denoising3. Modelfitting
N-bodysimulationdata:1. Dataclusteringtoextractgalaxies2. Graphanalyticstostudygalaxyevolution
PicturefromD.H.Stalder et.al.arXiv:1208.3444 [astro-ph.CO]
ScreenshotfromourMyMergerTree service
BigData
MagdalenaBalazinska - UniversityofWashington 4
Management
Analytics
Efficient Easy
GoalsoftheMyria stack• Advancestate-of-the-artinbigdatasystems• Focusonefficiencyandproductivity• Testonrealapplicationsandsupportrealusers
Deliverables:• Builtanewbigdatamgmt &analyticssystem• DeployedandoperateMyria asaservice
5MagdalenaBalazinska- UniversityofWashington
6MagdalenaBalazinska- UniversityofWashington
Myria hasbeendevelopedandisoperatedby• DatabaseGroupintheComputerScience&EngineeringDepartmentatUW
• UWeScience Institute
Co-PIs:DanSuciu andBillHowe
7
Myria Demo
MagdalenaBalazinska- UniversityofWashington
Myria CloudService
MagdalenaBalazinska- UniversityofWashington 8
Serviceavailablethroughprojectwebsite
AnalysisintheBrowserwithMyria
MagdalenaBalazinska- UniversityofWashington 9
Declarative-imperativeanalysiswithMyriaL/SQL/Python
Myria OperatesDirectlyonDatainS3
MagdalenaBalazinska- UniversityofWashington 10
Forefficientprocessing,cachesqueryresultsinternallyincluster
MyriaL isImperative+DeclarativewithIterations
MagdalenaBalazinska- UniversityofWashington 11
Myria ProvidesDetailsofQueryExecution
MagdalenaBalazinska- UniversityofWashington 12
Myria ServiceincludesJupyter Notebook
MagdalenaBalazinska- UniversityofWashington 13
Jupyter notebookavailabledirectlywithMyria service
Myria SupportsPythonUser-DefinedFunctions
MagdalenaBalazinska- UniversityofWashington 14
DatafromtheHumanConnectomeproject
MRIdataanalysis
PythonUDFsenablerunninglegacycodeandcomplexanalyticsbeyondSQL/MyriaL
UsersCanDeployOwnService
pip install myria-cluster
MagdalenaBalazinska- UniversityofWashington 15
myria-cluster create [OPTIONS] CLUSTER_NAME
myria-cluster stop/start/destroy […]
ExampleMyria Applications
16
NeuroscienceGalaxySimulations
NaturalLanguageProcessing
PicturefromLeilaZillesMyMergerTree Screenshot
DatafromtheHumanConnectome project
EnvironmentalFlowCytometry
100
101
102
103
104
100
101
102
103
104
ps3.fcs…subset
FSC
692-40
RED
fluo
resc
ence
FSC
Picoplankton
Nanoplankton
100
101
102
103
104
100
101
102
103
104
P35-surf
FSC Small Stuff
58
0-3
0
IS
Ultraplankton
100
101
102
103
104
100
101
102
103
104
P35-surf
FSC Small Stuff
69
2-4
0 litt
le s
tuff
Phytoplankton
Prochlorococcus
Bibliometrics
17
Myria Internals
MagdalenaBalazinska- UniversityofWashington
Myria Polystore Stack
Browser SpecializedServices
RACO
MyMergerTree
QueryTranslation,Optimization,andOrchestration
Python/Jupyter
Parallel, Iterative, and Elastic Query
Execution
MyriaXMPI
SciDB
Graphs
NoSQL
MagdalenaBalazinska- UniversityofWashington 18
MyriaX CloudDeployment
MagdalenaBalazinska- UniversityofWashington 19
AmazonEC2Instance
JSONqueryplans&APIcalls
CoordinatorREST Interface
Worker
HDFSAmazonEBSVolumesand/orLocalStorage
RDBMS
AmazonS3
Worker
YARNContainer
Worker
YARNContainer
YARNContainer
… …
YARNContainer
AmazonEC2Instance
RDBMS RDBMS
AmazonEC2Instance
… …
MyriaX Pipelines QueryExecution
20MagdalenaBalazinska- UniversityofWashington
SCAN&
SELECT
SCAN&
SELECT
SHUFFLEProducer
RDBMSHDFSS3
RDBMSHDFSS3
JOINShuffle
AGG
SELECT JOIN AGGREGATE
SHUFFLEProducer
SHUFFLEConsumer
JOIN
Worker 1
Worker 2SHUFFLEConsumer AGG
High performance query execution with pipelining
AutomaticDataPipes(completed)
ImageProcessing(indevelopment) Perf.Debugging
(indevelopment)
CloudPSLAs(completed)
Myria CloudOperation
PerformanceGuarantees(indevelopment)
ElasticMemory(indevelopment)
EfficientMulti-Join(completed)
IterativeQueries(completed)
EfficientProcessing&ComplexAnalyticswithMyriaX
DataSummaries(indevelopment)
SomeofMyria’s InnovationsDetails,papers,videos,andcode:http://myria.cs.washington.edu
Myria Federation
FederatedAnalytics(indevelopment)
MagdalenaBalazinska- UniversityofWashington 21
Overviewpaper:TheMyria BigDataManagementandAnalyticsSystemandCloudService.Myria Team.CIDR’17Conference
EfficientandEasyIterativeAnalytics
MagdalenaBalazinska- UniversityofWashington 22
ModernApplicationsRequireIterativeAnalytics
MagdalenaBalazinska- UniversityofWashington 23
• Social network: connected components
• Astronomy:evolutionofgalaxies
… …
PicturefromD.H.Stalderet.al.arXiv:1208.3444[astro-ph.CO]
Galaxy
ExistingSolutionsareNotSatisfactory
• Synchronousiterationsonly– AsterixDB,HaLoop,Pregel,REX,Spark,PrIter,Glog,…
• Single-node– LogicBlox, DatalogFS,…
• Nodeclarativelanguage– Stratosphere,Naiad,Grace,GraphLab,…
• Specializedforgraphs– GraphLab,Grace,…
• Not a data management system– SociaLite,…
• Theory on recursivequeries– DatalogFS,…
MagdalenaBalazinska- UniversityofWashington 24
Myria’s Approach
Full-stacksolutionforiterativeprocessing– Declarative language
• AsubsetofDatalog-with-Aggregation• ButweletusersexpresscomputationinMyriaL (SQL-based)
– Scalableand easily implementable• Smallextensions to existingshared-nothingsystems
– Efficient iterative computation• Execution models and optimizations• Implementationandempirical evaluationusing
MagdalenaBalazinska- UniversityofWashington 25
AsynchronousandFault-TolerantRecursiveDatalogEvaluationinShared-NothingEnginesJingjing Wang,MagdalenaBalazinska,andDanielHalperin.PVLDB 8(12):1542-1553(2015)
Myria’s OptimizedIterationsExample
Declarative QueryE = scan(jwang:cc:graph);V = select distinct E.$0 from E;do
CC := [$0, MIN($1)] <-[from V emit V.$0 as x, V.$0 as y] +[from E, CC where E.$0 = CC.$0 emit E.$1, CC.$1];
until convergence;store(CC, CC);
MagdalenaBalazinska - UniversityofWashington 26
AsynchronousandFault-TolerantRecursiveDatalogEvaluationinShared-NothingEnginesJingjing Wang,MagdalenaBalazinska,andDanielHalperin.PVLDB 8(12):1542-1553(2015)
Multiple relations with recursive dep.Subset positiveDatalog withagg.• Detailsinthepaper IDBController(CC) Scan(Edges)
Join
Scan(Edges)
Compiled to a Distributed Query Plan
ImportantRuntimeOptimizations
27
DeclarativeQuery(subsetofDatalog withagg.)
Shared-NothingQueryPlanIn-MemoryProcessing
Synchronous
Asynchronous
PrioritizeNewData PrioritizeBaseData
0
100
200
300
400
500
600
8 32
Time(secon
ds)
# workers
Galaxyevolution
020406080100120140160
8 32
Time(secon
ds)
# workers
LeastCommonAncestor
MagdalenaBalazinska- UniversityofWashington
ImportantRuntimeOptimizations
28
ConnectedComponents
0
500
1000
1500
2000
8 16 32 64
Time(secon
ds)
# workers
DeclarativeQuery(subsetofDatalog withagg.)
Shared-NothingQueryPlanIn-MemoryProcessing
Synchronous
Asynchronous
PrioritizeNewData PrioritizeBaseData
MagdalenaBalazinska- UniversityofWashington 28
PerformanceComparisonwithSparkDeclarativeQuery
(subsetofDatalog withagg.)
Shared-NothingQueryPlanIn-MemoryProcessing
Synchronous
Asynchronous
PrioritizeNewData PrioritizeBaseData
29
# of Workers8 16 32 64
0
50
100
150
200
250
Que
ry T
ime
(Sec
onds
)
Spark Myria, Sync Myria, Async
(GraphX) 29
IterativeProcessingSummary• Userspecifiesquerydeclaratively
– SubsetofDatalog withaggregation
• Generateefficient,shared-nothingqueryplan
• Planamenabletoruntimeoptimizations– Synchronousvsasynchronous– Differentprocessingpriorities
• Optimizationssignificantlyaffectperformance
MagdalenaBalazinska- UniversityofWashington 30
DataMovementforPolystore Analytics
MagdalenaBalazinska- UniversityofWashington 31
Polystore Motivation
MagdalenaBalazinska- UniversityofWashington 32
Preprocess Cluster
N-BodyDataAnalysiswithMyria
Data
Preprocess Cluster
Spark
Preprocess ClusterExport Import
DataMovementisExpensive
MagdalenaBalazinska- UniversityofWashington 33
0
30
60
Min
utes
Preprocess Transfer Cluster
NeedEfficientDataMovement
MagdalenaBalazinska- UniversityofWashington 34
Preprocess Cluster
0
30
60
File System Socket
Min
utes
Preprocess Transfer Cluster
CanweGenerateEfficientDataPipesAutomatically?
MagdalenaBalazinska- UniversityofWashington 35
DataMovementwithPipeGen
Worker1
Worker"
SourceDBMS
User
t = scan(data)x = distances(t,t)export(x,'db://Target')
x = import('db://Source')u = cluster(x)
WorkerDirectorysource.w1à target.wmsource.wnà target.w1
[1] [2]
[3]
[4]
Worker1
Worker#
TargetDBMS
…
UserorOpt.
A+
DBMSBytecode
UnitTests
PipeGen
Pipegen-EnabledDBMSStep1:Generatenewdatapipecode
Step2:Usethenewdatapipes
MagdalenaBalazinska - UniversityofWashington 36
PipeGen:DataPipeGeneratorforHybridAnalyticsBrandonHaynes,AlvinCheung,andMagdalenaBalazinska.SOCC2016.
DataMovementwithPipeGen
PipeGen:AutomaticdatapipegeneratorDBMS
bytecodeDBMS with optimizeddata pipe
PipeVerify:Verification
IORedirect: I/O RedirectorIdentify
File Open Expressions
InjectConditional Redirection
InstrumentUnit Tests
InstrumentUnit Tests
Data Flow Analysis
Type Substitution
FormOpt: Format Optimizer
Data Pipe Type
Augmented Types
Modifiesbytecodeofanalyticsengines
Enablesparalleldatatransferusingefficient,binaryArrowformat
PipeGen:DataPipeGeneratorforHybridAnalyticsBrandonHaynes,AlvinCheung,andMagdalenaBalazinska.SOCC2016.
MagdalenaBalazinska- UniversityofWashington 37
PipeGen’s Performance
MagdalenaBalazinska- UniversityofWashington 38
MagdalenaBalazinska- UniversityofWashington 39
PerformanceBreakdown
0
20
40
60
80
100
120
1E0 2E8 4E8 6E8 8E8 1E9
TransferTim
e(M
inutes)
#ElementsTransferredHDFS(R=3) HDFS(R=1) FileSystemIORedirect AllOptimizations ManualOptimization
Transfer fromMyria toGyraph
Polystore DataMovementSummary
• Polystore analyticsisuseful– Enablestheuseoffeaturesindifferentsystems– Canimproveanalyticsperformance
• But,needabilitytomovedataefficiently
• PipeGen generatesefficientdatapipes– Movedatainparallelandwithouttouchingdisk– Movedatausinganefficient,binaryformat
MagdalenaBalazinska- UniversityofWashington 40
AutomaticDataPipes(completed)
ImageProcessing(indevelopment) Perf.Debugging
(indevelopment)
CloudPSLAs(completed)
Myria CloudOperation
PerformanceGuarantees(indevelopment)
ElasticMemory(indevelopment)
EfficientMuli-Joins(completed)
IterativeQueries(completed)
EfficientProcessing&ComplexAnalyticswithMyriaX
DataSummaries(indevelopment)
SomeofMyria’s InnovationsDetails,papers,videos,andcode:http://myria.cs.washington.edu
Myria Federation
FederatedAnalytics(indevelopment)
MagdalenaBalazinska- UniversityofWashington 41
Overviewpaper:TheMyria BigDataManagementandAnalyticsSystemandCloudService.Myria Team.CIDR’17Conference
CloudServicewithPerformanceSLA
MagdalenaBalazinska- UniversityofWashington 42
MagdalenaBalazinska- UniversityofWashington 43
UserShouldSimplyUploadHerData
OrpointtodatainAmazonS3
Myria’s PersonalizedServiceLevelAgreements
44
ChangingtheFaceofDatabaseCloudServiceswithPersonalizedServiceLevelAgreementsJenniferOrtiz,VictorT.Almeida,andMagdalenaBalazinska.CIDR2015
MagdalenaBalazinska- UniversityofWashington
WorkloadCompressionintoPSLA
WorkloadGeneration
QueryClustering
TemplateGeneration
Cross-TierPruning PSLASchema
RuntimePrediction
Myria’s SLAgeneration
Myria’s PerfEnforce Subsystem
45
PerfEnforceDemonstration:DataAnalyticswithPerformanceGuaranteesJenniferOrtiz,BrendanLee,andMagdalenaBalazinska.SIGMOD2016.
MagdalenaBalazinska- UniversityofWashington
MagdalenaBalazinska - UniversityofWashington
Myria’s PerfEnforce Subsystem
46
Clustersizechangesduringquerysession
PerfEnforceDemonstration:DataAnalyticswithPerformanceGuaranteesJenniferOrtiz,BrendanLee,andMagdalenaBalazinska.SIGMOD2016.
ConclusionWhatMakesMyria Interesting?
• Highlyexpressive– MyriaL (RA+iterations),SQL,&Python
• Polystore withhybridanalytics• Highperformanceonvarietyofqueries
– Greatforusers– Pushesstate-of-art
• Availableasaservice– Focusonlowbarriertoentry– Andturningusersintoself-sufficientexperts– Alsofocusontheserviceprovider:OperateMyria
• Sourcecodeandmoreinfo(includesvideos)http://myria.cs.washington.edu/
47MagdalenaBalazinska- UniversityofWashington
AcknowledgmentsTheMyria Team!• Specialthanksto:VictorAlmeida,TobinBaker,AlvinCheung,Shumo Chu,DanielHalperin,BrandonHaynes,BillHowe,Shrainik Jain,ParisKoutris,BrendanLee,RyanMaas,DominikMoritz,JenniferOrtiz,DanSuciu,Jingjing Wang,AndrewWhitaker,Shengliang Xu
Oursciencecollaborators!!Oursponsors!!!• NationalScienceFoundation,Moore&SloanFoundations,WashingtonResearchFoundation,eScience Institute,ISTCBigData,Petrobras,EMC,Amazon,andFacebook
48MagdalenaBalazinska- UniversityofWashington