Lessons Learned: Building Scalable Applications with the Windows Azure Platform

  • Published on
    24-Feb-2016

  • View
    42

  • Download
    0

Embed Size (px)

DESCRIPTION

SVC32. Lessons Learned: Building Scalable Applications with the Windows Azure Platform. Simon Davies Windows Azure TSP Microsoft Corporation. Agenda. Objectives of this session Thoughts on scalabilty in the cloud Real World Lessons Learned Thuzi RiskMetrics Summary - PowerPoint PPT Presentation

Transcript

SVC32: Lessons Learned: Building Scalable Applications with the Windows Azure Platform

Lessons Learned: Building Scalable Applications with the Windows Azure PlatformSimon DaviesWindows Azure TSPMicrosoft CorporationSVC3211/18/2009 9:12 AM 2007 Microsoft Corporation. All rights reserved. Microsoft, Windows, Windows Vista and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries.The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

1AgendaObjectives of this sessionThoughts on scalabilty in the cloudReal World Lessons LearnedThuziRiskMetricsSummaryQuestions and Answers Scalability in the CloudScalability==work\resourcesWindows Azure makes adding AND REMOVING resources dynamicThis along with the business model -changes thingsCapacity planning becomes dynamicUtilisation levels are importantDefinition of scale is different depending on application type and workload arrival characteristics

Scaling Facebook Apps in the Azure CloudJim ZimmermanCTO / Lead Developer Thuzi.compartner

11/18/2009 9:12 AM 2007 Microsoft Corporation. All rights reserved. Microsoft, Windows, Windows Vista and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries.The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

4Who is Thuzi?We develop customized viral marketing solutions, utilizing a variety of technologies that engage users and measure results.We ensure maximum scalabilitythrough exploiting the latest virtual computing by using Microsoft's Azure Platform and ToolsFacebook Viral Application NeedsSupport for thousands of users virtually overnight our models predicted geometric adoption The success of one of our clients could not be the failure for others requirement for distinct computing environments for each Thuzi customerOur job is to turn social media data into real business information must have a robust back end for reporting detailed analytics

Facebook Viral Application NeedsThuzi builds cool social media web apps and we dont know much about running data centers besides we didnt want to purchase extra servers just in case

A consistent user experience was mandatory social media users dont like to wait

Hosting OptionsOur own data center Is too expensive and with unpredictable growth, hard to plan forGoogle Didnt have a familiar programming environmentAmazon Could use Windows VMs, but did not have as many features as we wantedAzure - Familiar Microsoft Technologies8Technology

Outback DEMO

The Results

Fan Growth over Time

Lessons LearnedTrace everything!Errors, Debug InfoYou will upgrade later if as you start to ask questions about how your app is behavingTrack Perf CountersCPU Usage, Req/sec, memory usageUse Worker roles to move data from Queues to table storage and SQL AzureSQL is easier to report onTable storage allows more scalabilityDeploymentUpgrade ManuallyWhen moving to production, use the VIP Swap featureTracingconfig.DiagnosticInfrastructureLogs.ScheduledTransferLogLevelFilter = LogLevel.Error; config.DiagnosticInfrastructureLogs.ScheduledTransferPeriod = TimeSpan.FromMinutes(5); config.Logs.ScheduledTransferLogLevelFilter = LogLevel.Error; config.Logs.ScheduledTransferPeriod = TimeSpan.FromMinutes(5);

Performance Monitoringvar cpuUsage = new PerformanceCounterConfiguration();cpuUsage.CounterSpecifier = @"\Processor(_Total)\% Processor Time";cpuUsage.SampleRate = TimeSpan.FromSeconds(5);var pccMemory = new PerformanceCounterConfiguration();pccMemory.CounterSpecifier = @"\Memory\Available Mbytes";pccMemory.SampleRate = TimeSpan.FromSeconds(5);var requestsPerSec = new PerformanceCounterConfiguration();requestsPerSec.CounterSpecifier = @"\ASP.NET Applications(__Total__)\Requests/Sec";requestsPerSec.SampleRate = TimeSpan.FromSeconds(5);config.PerformanceCounters.DataSources.Add(cpuUsage);config.PerformanceCounters.DataSources.Add(pccMemory);config.PerformanceCounters.DataSources.Add(requestsPerSec);config.PerformanceCounters.ScheduledTransferPeriod = TimeSpan.FromMinutes(5);DeploymentUpload new package to stagingWait for all roles to be readyUse VIP Swap to upgrade and deploy to productionRewrites the load balancer to swap staging with productionIf anything is wrong, you can swap back

Tools NeededNeeded to be able to manage records in table storage for testingNeeded to be able to download logs from table storage for tracing and perf countersAzure Storage Explorer ( Codeplex ) - FreeCloud Storage Studio - CostDo linq queries against table storage to get specific info when needed.

In SummaryAzure provides Thuzi a competitive advantage so please dont tell the other social media marketing companies and let us enjoy our 15 minute advantage

Building Scalable Applications using Windows Azure: RiskMetrics RiskBurstRob Fraser and Phil JacobRiskMetrics Groupwww.riskmetrics.compartner

11/18/2009 9:12 AM 2007 Microsoft Corporation. All rights reserved. Microsoft, Windows, Windows Vista and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries.The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

19RiskMetrics RiskBurstRiskMetrics GroupOffers industry-leading products and services in the disciplines of risk management, corporate governance and financial research & analysisScaling on-premise computation to the CloudIntegration of RiskMetrics extensive on-premise capability with Windows AzureWe are running on 2,000 instances on Windows AzureWe have plans to use 10,000+ instances in 2010What are RiskMetrics doing with so much computing power?Calculation of financial riskSimulate scenarios for the movement of market factors over time & price financial assets in those scenariosNotoriously complex can involve Monte Carlo2 for complex asset classes of the kind that the triggered the 'credit crunchResults in very high computational loads for RiskMetricsDaily risk analysis load equivalent to calculating risk on 4 trillion US StocksComputational loads are characterised by high demand peaksStrong growth trend in calculation complexitywww.riskmetrics.com#Peak Load Characteristics

www.riskmetrics.com#Growth trend in calculation complexity

Risk problem complexity has doubled every 6 monthsMoores LawProcessor power doubles every 2 yearsRelative Equity Equivalent Units (Log Scale) Maximum Complexity of Risk Analysis Processing Requestwww.riskmetrics.com#22Analytics Architecture: Large-Scale Data Dependent Processing vs. Distributable Work PacketsLoadBalancerMarket and Pricing DataVelocity ScenarioCacheRiskServerRiskServerRiskServerRiskServerRiskServerPricerPricerPricerPricerPricerPricerScenario Generation and Aggregation:These Services dependent on high speed access to large scale data stores and caches

Scenario Pricing:Work Packets are self-contained www.riskmetrics.com#Work Packet Example:Pricing request for a Mortgage Backed SecurityCompute Time: 150ms - 30swww.riskmetrics.com#Analytics Architecture: Integration of Cloud Resources?LoadBalancerMarket and Pricing DataVelocity ScenarioCacheRiskServerRiskServerRiskServerRiskServerRiskServerPricerPricerPricerPricerPricerPricer

PricerPricerPricerPricerPricerPricerScenario Generation and Aggregation:These Services dependent on high speed access to large scale data stores and cachesScenario Pricing:Work Packets are self-contained

www.riskmetrics.com#RiskBurst Project Timelinewww.riskmetrics.com#RiskBurstAn architectural pattern for large scale computational applications www.riskmetrics.com#Architectural PatternBuilding large scale computation requires careful designProblem: Need to avoid the Von Neumann BottleneckKeywords: Reason and Instrument No changes to the applicationRun on-premise on HPC Server or in cloud on AzurePattern has end-to-end decouplingHorizontal scaling of decoupled components

Computational Resources & ApplicationWorkloadGenerationMessaging & StorageWorkloadGenerationWorkloadGenerationWorkloadGenerationMessaging & StorageMessaging & StorageMessaging & StorageMessaging & StorageMessaging & StorageMessaging & StorageComputational Resources & ApplicationComputational Resources & ApplicationComputational Resources & ApplicationComputational Resources & ApplicationComputational Resources & Applicationwww.riskmetrics.com#28RiskBurst Workflow: Windows Azure & HPC ServerRiskBurst ServerWorkload ReceiverBatching and SendingOutstanding Request Timeout SweeperScenario GeneratorWindows AzureOutput Queue(s)

Windows AzureInput Queue(s)

WCF RequestWCF RequestWCF RequestInput MessageOutput MessageWCF ResponseWCF ResponseWCF ResponseWCF Error ResponseWorker Output Monitoring

www.riskmetrics.com#29Azure QueueAzure QueueAzure QueueAzure QueueAzure QueueAzure QueueAzure QueueAzure QueueWorker RoleInstanceWorker RoleInstanceWorker RoleInstanceWorker RoleInstanceInput Queues (To Do Jobs)Input Blob StorageLocal storageDataSupport files in Blob StorageWindows Azure Storage Component Usage

RiskBurstServer

Azure QueueOutput Queues (Job done)Output Blob StorageAzure Queuewww.riskmetrics.com#Mapping to the Azure EnvironmentVisual Studio 2008 Azure development SDK mimics cloudMix code running in dev locally, with cloud resources such as Blob storage or queuesGood for features, does not assist with scaleExisting 32-bit .NET C++/CLI application with 3 third-party librariesInitial idea - run directly in web-role but 32-bit(!) Run within worker rolePreserve WCF interface no changes whatsoever to analytics appOnly changes to existing code base are:Retrieve Cash-flow library support files from Blob storage on demandSome diagnostic information added

www.riskmetrics.com#31Getting to Cloud Resources: Bandwidth & LatencyProblem: Bandwidth to Azure gateway limited by InternetSolution: pass by reference & blobsReplace pass-by-value calls with pass-by-referenceCreate key for scenarioLarge, repeated objects (scenarios) pushed to blob storageWCF call contains only keyEach of 1000 scenarios, used for all assetsProblem: Communications Latency Within data centre, 20ms latency on WCF call through HPC SOA platformQueues and Blob storage are off-device; engineering must respect this!Work packet : 200ms computationSolution: batch requests within input queuesBut, more simultaneous work requests (threads outstanding on input)

www.riskmetrics.com#Utilizing Cloud Resources: Generating Load

www.riskmetrics.com#Utilizing Cloud Resources: Generating LoadProblem: Generating Load for Cloud Resources Threading architectureWorkload originally generated by synchronous calls in clientNumber of outstanding pricing requests = nodes x batch sizeImplies large number of threads in wait states in scenario generatorsWork request made asynchronousRiskBurst Server LogicCreates a balanced workload uses a work items average run timeMade calls to RiskBurst Server asynchronousIncoming calls create batch entry synchronously with requestMap created from message id to wait handlersWhen batch full, sent on to Azure input queueSweeper thread gathers up output messages and uses map to associate with wait handlersScales well to over 1000 simultaneous requests per RiskBurst Server Horizontal scale of RiskBurst Servers each creates own input queue www.riskmetrics.com#Horizontal Scaling within the CloudProblem: Saturation behaviour of queuesCan create situation where queues are saturated, made worse by retry logicComplexity due to varied processing timeController will move busy queues to independent hardwareUse exponential back-off algorithmBatch work items for each queue read or write (using 10 work packets per queue item)Amortizing the cost of IO against CPU time is key Batch compute sizes need to be big enough both to occupy the CPU for long enough and not cause the swamping of the queues Also, more items contained in queue item -> fewer queue hitsBut, larger batches imply more simultaneous outstanding connections on client sideVariable run-time of assets from 150ms 30 secondsCarry out processing concurrently with queue accessPushing IO onto background threads is critical (the writes and the deletes are independent background tasks)On-node caching within worker role to avoid queue reads

www.riskmetrics.com#35Exception Management in Distributed ApplicationsKeep it simpleLarge distributed system implies need to engineer robustness to failureDistinguish between events that are random and unpredictable and poison-message kind of failuresDo not over-engineer efficient handling of occasional exceptionsReturn exceptions to client applicationClient can track number of attempts to process a work itemDistinguish poison messages and give upParallel handling on HPC Server SOA platformComplexity from varying message processing timesTime-outs can be caused by several long-running pricings in same jobRe-try time-outs by sending all pricings in batch independentlywww.riskmetrics.com#Diagnostics and Run-time MonitoringA challenge for large scale applications, even more so for CloudLogging and monitoring must be switchable so as to reduce overheadVariable level of diagnostics and loggingRequirement to filter information through decoupled architecture (on node; centralized in Azure; returned to client)Key data for architectural patternRequest and result queue; successful/unsuccessful read, write and delete; time taken for all operationsEmpty request queue getsCount of successful/unsuccessful work packets% Processor Time performance counterCache missesWe utilized custom built solution during TAPNodes broadcast over service busClients subscribe to trace mes...

Recommended

View more >