Upload
suganya-periasamy
View
9
Download
0
Tags:
Embed Size (px)
DESCRIPTION
A Local Optimization Based Strategy for Cost-Effective Datasets Storage of Scientific Applications in the Cloud (1)
Citation preview
PowerPoint Presentation
A Local-Optimization based Strategy for Cost-Effective Datasets Storage of Scientific Applications in the CloudMany slides from authors presentation on CLOUD 2011
Presenter: Guagndong LiuMar 13th, 2012Dec 8th , 2011 Dec 8th , 2011 1OutlineIntroductionA Motivating ExampleProblem AnalysisImportant Concepts and Cost Model of Datasets Storage in the CloudA Local-Optimization based Strategy for Cost-Effective Datasets Storage of Scientific Applications in the Cloud Evaluation and Simulation
Dec 8th , 2011 Dec 8th , 2011 IntroductionScientific applicationsComputation and data intensiveGenerated data sets: terabytes or even petabytes in sizeHuge computation: e.g. scientific workflowIntermediate data: important!Reuse or reanalyzeFor sharing between institutionsRegeneration vs storing
Dec 8th , 2011 Dec 8th , 2011 IntroductionCloud computingA new way for deploying scientific applicationsPay-as-you-go modelStoring strategyWhich generated dataset should be stored?Tradeoff between cost and user preferenceCost-effective strategy Dec 8th , 2011 Dec 8th , 2011 A Motivating ExampleParkes radio telescope and pulsar surveyPulsar searching workflow
Dec 8th , 2011 Dec 8th , 2011 A Motivating ExampleCurrent storage strategy Delete all the intermediate data, due to storage limitationSome intermediate data should be storedSome need not
Dec 8th , 2011 Dec 8th , 2011 Problem AnalysisWhich datasets should be stored?Data challenge: double every year over the next decade and further -- [Szalay et al. Nature, 2006]Different strategies correspond to different costsScientific workflows are very complex and there are dependencies among datasetsFurthermore, one scientist can not decide the storage status of a dataset anymoreData accessing delay Datasets should be stored based on the trade-off of computation cost and storage costA cost-effective datasets storage strategy is needed
Dec 8th , 2011 Dec 8th , 2011 Important ConceptsData Dependency Graph (DDG)A classification of the application data Original data and generated dataData provenanceA kind of meta-data that records how data are generatedDDG
Dec 8th , 2011 Dec 8th , 2011 Important ConceptsAttributes of a Dataset in DDGA dataset di in DDG has the attributes: xi ($) denotes the generation cost of dataset di from its direct predecessors. yi ($/t) denotes the cost of storing dataset di in the system per time unit. fi (Boolean) is a flag, which denotes the status whether dataset di is stored or deleted in the system. vi (Hz) denotes the usage frequency, which indicates how often di is used.Dec 8th , 2011 Dec 8th , 2011 Important ConceptsAttributes of a Dataset in DDGprovSeti denotes the set of stored provenances that are needed when regenerating dataset di.
CostRi ($/t) is dis cost rate, which means the average cost per time unit of di in the system.
Cost = C + SC: total cost of computation resourcesS: total cost of storage resources
Dec 8th , 2011 Dec 8th , 2011 Cost Model of Datasets Storage in the CloudTotal cost rate of a DDG:S is the storage strategy of the DDG
For a DDG with n datasets, there are 2n different storage strategies
Dec 8th , 2011 Dec 8th , 2011 CTT-SP AlgorithmTo find the minimum cost storage strategy for a DDGPhilosophy of the algorithm:Construct a Cost Transitive Tournament (CTT) based on the DDG. In the CTT, the paths (from the start to the end dataset) have one-to-one mapping to the storage strategies of the DDGThe length of each path equals to the total cost rate of the corresponding storage strategy.The Shortest Path (SP) represents the minimum cost storage strategy
Dec 8th , 2011 Dec 8th , 2011 CTT-SP AlgorithmExampleThe weights of cost edges:
Dec 8th , 2011 Dec 8th , 2011 A Local-Optimization based Datasets Storage StrategyRequirements of Storage StrategyEfficiency and ScalabilityThe strategy is used at runtime in the cloud and the DDG may be largeThe strategy itself takes computation resourcesReflect users preference and data accessing delayUsers may want to store some datasets Users may have certain tolerance of data accessing delay
Dec 8th , 2011 Dec 8th , 2011 A Local-Optimization based Datasets Storage StrategyIntroduce two new attributes of the datasets in DDG to represent users accessing delay tolerance, which are Ti is a duration of time that denotes users tolerance of dataset dis accessing delay
i is the parameter to denote users cost related tolerance of dataset dis accessing delay, which is a value between 0 and 1
Dec 8th , 2011 Dec 8th , 2011 A Local-Optimization based Datasets Storage Strategy
Dec 8th , 2011 Dec 8th , 2011 A Local-Optimization based Datasets Storage StrategyEfficiency and ScalabilityA general DDG is very complex. The computation complexity of CTT-SP algorithm is O(n9), which is not efficient and scalable to be used on large DDGsPartition the large DDG into small linear segments
Utilize CTT-SP algorithm on linear DDG segments in order to guarantee a localized optimum
Dec 8th , 2011 Dec 8th , 2011 EvaluationUse random generated DDG for simulationSize: randomly distributed from 100GB to 1TB.Generation time : randomly distributed from 1 hour to 10 hoursUsage frequency: randomly distributed 1 day to 10 days (time between every usage). Users delay tolerance (Ti) , randomly distributed from 10 hours to one day Cost parameter (i) : randomly distributed from 0.7 to 1 to every datasets in the DDGAdopt Amazon cloud services price model (EC2+S3):$0.15 per Gigabyte per month for the storage resources.$0.1 per CPU hour for the computation resources.
Dec 8th , 2011 Dec 8th , 2011 EvaluationCompare different storage strategies with proposed strategyUsage based strategyGeneration cost based strategyCost rate based strategy Dec 8th , 2011 Dec 8th , 2011 Evaluation
Dec 8th , 2011 Dec 8th , 2011 Evaluation
Dec 8th , 2011 Dec 8th , 2011
2007 The Board of Regents of the University of Nebraska. All rights reserved.Thanks
Dec 8th , 2011 Dec 8th , 2011 Dec 8th , 2011 22Candidates
Candidates
Beam
Beam
De-disperse
Acceleate
Raw beam data
Accelerated De-dispersion files
De-dispersion files
Extracted & compressed beam
Seek results files
Candidate list
XML files
Size:
Generation time:
20 GB
245 mins
1 mins
80 mins
300 mins
790 mins
27 mins
25 KB
1 KB
16 MB
90 GB
90 GB
d1
d2
d3
d8
d7
d6
d4
d5
d1
d2
d3
(x1 , y1 ,v1)
(x3 , y3 ,v3)
(x2 , y2 ,v2)
S1 : f1 =1 f2 =0 f3 =0
S2 : f1 =0 f2 =0 f3 =1
...
y1
d1
d2
d3
(x1 , y1 ,v1)
(x3 , y3 ,v3)
(x2 , y2 ,v2)
x1v1+y2
d1
d2
d3
ds
de
x3v3
x2v2+y3
x2v2+(x2+x3)v3
x1v1+(x1+x2)v2+(x1+x2+x3)v3
x1v1+(x1+x2)v2+y3
y2
y3
0
DDG
CTT
...
...
...
...
Linear DDG1
Linear DDG3
Linear DDG2
Linear DDG4
Partitioning point dataset
Partitioning point dataset