Upload
audrey-spencer
View
214
Download
0
Tags:
Embed Size (px)
Citation preview
Windows Azure from the Pulpit to the WhiteboardRyan Dunn & Wade Wegner
WAD-B351
ExampleCustomer with > 300 VMs deployed and 100’s of SQL Azure databases.
Error in DB connection logic and tight loop retry. Each error is traced with full stack trace.Over 2GB of trace data per minute being generated
Table storage data format is verbose to begin with, but…1 GB NIC completely saturated on 16 workers trying to keep upTimeout during read due to too much data, which caused RETRYAutoscale noticed high queue levels stacking – scaled to max
What we will cover todayLessons Learned (the hard way)
Building for ScaleAutomation, TestingDeployment Patterns in Windows AzureHandling Disaster/Downtime
ACCELERATE INNOVATIONS USING CLOUD
DIFFERENTIATE WITH DESIGN AND USER EXPERIENCE
DELIVER SCALE AND AGILITY
TO THE
CLOUD.
THE RIGHT
WAY.
What we do at Aditi
Our clients are technology leaders …
AzureOps.comMonitors deployments in Windows AzureAt peak, monitored ~3000 VMs in 6 datacentersProcessed TBs of trace data per month, GBs of perf countersConsumed half-billion storage transactions per monthRan on 2 S, 4 M + (2 M x DC), and 12 XS instances.
Auto-scales based on custom metrics
Alerts based on custom rules
Scheduler
SchedulerTime-based job schedulingFeaturesWebhooks (GET, POST), Windows Azure QueuesBasic Auth & NoneNuGet
500,000 job executions
Live API documentation
Four plans available in store
Aditi’s High Level Architecture (CQRS)Web Scheduler
Query Svc
Command Handlers
Domain Model
Events Event HandlersEven
t B
us
Denormalizers
Data Access Layer
Event Data
View Data
Service Client
Com
man
ds
Qu
eri
es
Why did we choose this architecture?• Allowed us to easily scale our backend
(async) while keeping front-end very responsive.
• Compartmentalized logic in handlers that could be independently developed/tested
• Event sourcing not only gave us what, but how. Audit history came along for the ride.
• Flexibility to add or modify views at any time and regenerate
ExampleCustomer with > 300 VMs deployed and 100’s of SQL Azure databases.
Error in DB connection logic and tight loop retry. Each error is traced with full stack trace.Over 2GB of trace data per minute being generated
Table storage data format is verbose to begin with, but…1 GB NIC completely saturated on 16 workers trying to keep upTimeout during read due to too much data, which caused RETRYAutoscale noticed high queue levels stacking – scaled to max
SolutionThrottle! Protect your service.Read first 50,000 traces in 5 mins and raised Throttled event
Be very cautious on Retry policiesVirtuous cycles can bite
Protect your servicesAlways assume the worse will happenIt will, guaranteed.Users will find ways to crash your services, guaranteed.
Plan to throttleTimeouts, max result size, # of queries per min
Be wary of retriesAutomatic retries rarely work like you think they doRetry policies can lead to even bigger failures
Learn from your mistakes
ExampleAutoscale system would periodically kick up 4 new instances based on queue length
Queues would bleed down, repeat every 2 hours (as 2 instances would scale away)
Unknown reason why command queue length was increasing on cyclical 2 hour cycle
SolutionAdded custom counters by type of command
Found that calls to SMAPI would have average of 60 seconds and as long as 240 seconds (fastest was 30).
Routed SMAPI refresh commands to own queue and put 12 XS workers on it to processSame cost as 2 S workers6x as many commands processed
ExampleCustomer would update settings and UI would not reflect the change.Eventually, change would be updated, but history would often show multiple updates of exact same data.Customers contacted support to complain
SolutionFound that commands from customers were getting routed to same processing queue as some other much longer running commands
Re-prioritized commands coming from UI to have its own queue with dedicated resources (VIP queues)
Push your hard work to queuesApplies almost universally in CQRS worldAllows for asynchronous dispatch and eventual handling
Prioritize queuesE.g. VIP Queues for UI commandsPrevents stacking of high priority messages that update views
Alleviates front-end from blocking callsHTTP requests become highly efficient and scalable when combined with Read Optimization
Snapshot 1
Snapshot 2
SolutionFound that deserialization of event store processing grew almost exponentially depending on size and number of events.Fixed with an implementation of snapshotting
Snapshot 3
ExampleQueue length in North Europe was consistently longer than North America despite having fewer tenants (less work to do).Autoscaler was frequently and aggressively scaling up to handle the queue.But… processing time for aggregation was same or shorter than other geographies
SolutionOne North Europe tenant had used same storage account for both load testing devtest environment as production.
It was equivalent of finding a needle in haystack
Tuned our timeouts for aggregation scheduling (before aggregation).
Identifying BottlenecksInstrumentation is keyCustom performance counters is niceSimple timer that traces long commands works well too
Watch for trends over timeSnapshots might not help until you look at them over time
Profile your code when data indicates a problem
Tune and then verify changes
Optimize for reads
Even
t B
us
RegisterUser
UserRegistrationHandler
UserRegistered
UserProfileHandler
UserQuotaHandler
User Profile
Quota View
Storage
Optimize for reads
Pre-calculate and store each view that will be displayed to your users.It’s ok if data is slightly stale. Really.
Optimize for reads
Storage is cheapCreate a new view for each ‘task’ on the UIIt’s OK to have the same data in multiple views
Optimize for reads
Don’t make your web servers work hardServe JSON files directly from storage.Cache & use EtagsLight transformations on dynamic data is OK.
Learn to live with stale data
Pre-calculated views are by definition already staleIn CQRS, events raised from commands are dispatched to denormalizers from event bus
Learn to live with stale data
Create a view for each ‘task’ on the UIStatic views versus dynamic views
Learn to live with stale data
Dynamic View == Queryable dataE.g. last X hours of trace information with Y levelServed efficiently from table storage.
Learn to live with stale data
Static data E.g. account details, current balance, or settingsIdeal for JSON files sitting in blob storage
Work close to your dataBandwidth becomes limiting factorBetween datacentersBetween VM and NIC
Cost goes down, performance goes up.Win, Win!
Employ a message routing strategyMessages can get routed to geo-specific queuesMessages can get prioritizedMessages can get quarantined
Building for Scale RecapOptimize for reads
Learn to live with stale data
Work close to your data
Protect your services
Push your hard work to queues
Instrument
Why automate?Reproducibility, Reproducibility
Takes the ‘I forgot to…’ out of it.
Automate
Data migrations tie it together.
Continuous builds raise the quality bar.
Visual Studio deploys verboten.
Build Demo
Automation RecapBuild Automation
Deployment Automation
Data Migrations
PaaS vs IaaSLarge scale (> 50 instances) requires PaaS todayPaaS is a much easier deployment, upgrade, and maintenance model.Requires architecting differently – no state, idempotent, etc.
IaaS is wonderful for stateful appsPaaS FTW
Extra Small VMsHidden gem amongst the instance sizes4x cost advantageIf you can live within bandwidth and memory constraints, big bang for buck.
Websites versus WebRolesGit deployment on Windows Azure Websites is very nice
Ability to use RoleEntry point on WebRoles can be more important than ease of deploymentAbility to control dependenciesScale beyond websites and VMsSSL support is free
WebRoles FTW
Single vs Many DeploymentsMany deployments required to have geo-redundancy.Coordinating upgrades becomes challenging
Ability to dynamically route messages works to your advantageJust ‘turn off’ geo-route until upgrade completesIf datacenter is having ‘issues’, you can remove from routing
Deployment PatternsPaaS vs IaaS
Extra Small VMs
Azure Websites vs WebRoles
Single Deployment vs many
Handling outages and disasterOutages are extremely commonTrust meService degradations
Every service you use has its own SLA
The best you can do is the multiplication of each SLAE.g. 99.95 * 99.9 * 99.9 = 99.75
Your service will go down
Remediation StrategiesWeigh your riskTradeoffs abound, what is your single point of failure?Queues can helpDistributed Queues might be necessary
Geo-DistributionFault domainsMulti-datacenterMulti-cloud
RecapBuilding for Scale
Automation and Testing
Deployment Patterns
Disaster and Recovery
Web | Blog | Facebook | Twitter | LinkedIn
Evaluate this session
Scan this QR code to evaluate this session.
© 2013 Microsoft Corporation. All rights reserved. Microsoft, Windows and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries.The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.