ACCELERATE INNOVATIONS USING CLOUD DIFFERENTIATE WITH DESIGN AND USER EXPERIENCE DELIVER SCALE AND AGILITY TO THE CLOUD. THE RIGHT WAY. What we

Windows Azure from the Pulpit to the WhiteboardRyan Dunn & Wade Wegner

WAD-B351

ExampleCustomer with > 300 VMs deployed and 100’s of SQL Azure databases.

Error in DB connection logic and tight loop retry. Each error is traced with full stack trace.Over 2GB of trace data per minute being generated

Table storage data format is verbose to begin with, but…1 GB NIC completely saturated on 16 workers trying to keep upTimeout during read due to too much data, which caused RETRYAutoscale noticed high queue levels stacking – scaled to max

What we will cover todayLessons Learned (the hard way)

Building for ScaleAutomation, TestingDeployment Patterns in Windows AzureHandling Disaster/Downtime

ACCELERATE INNOVATIONS USING CLOUD

DIFFERENTIATE WITH DESIGN AND USER EXPERIENCE

DELIVER SCALE AND AGILITY

TO THE

CLOUD.

THE RIGHT

WAY.

What we do at Aditi

Our clients are technology leaders …

AzureOps.comMonitors deployments in Windows AzureAt peak, monitored ~3000 VMs in 6 datacentersProcessed TBs of trace data per month, GBs of perf countersConsumed half-billion storage transactions per monthRan on 2 S, 4 M + (2 M x DC), and 12 XS instances.

Auto-scales based on custom metrics

Alerts based on custom rules

Scheduler

SchedulerTime-based job schedulingFeaturesWebhooks (GET, POST), Windows Azure QueuesBasic Auth & NoneNuGet

500,000 job executions

Live API documentation

Four plans available in store

Aditi’s High Level Architecture (CQRS)Web Scheduler

Query Svc

Command Handlers

Domain Model

Events Event HandlersEven

t B

us

Denormalizers

Data Access Layer

Event Data

View Data

Service Client

Com

man

ds

Qu

eri

es

Why did we choose this architecture?• Allowed us to easily scale our backend

(async) while keeping front-end very responsive.

• Compartmentalized logic in handlers that could be independently developed/tested

• Event sourcing not only gave us what, but how. Audit history came along for the ride.

• Flexibility to add or modify views at any time and regenerate

ExampleCustomer with > 300 VMs deployed and 100’s of SQL Azure databases.

Error in DB connection logic and tight loop retry. Each error is traced with full stack trace.Over 2GB of trace data per minute being generated

Table storage data format is verbose to begin with, but…1 GB NIC completely saturated on 16 workers trying to keep upTimeout during read due to too much data, which caused RETRYAutoscale noticed high queue levels stacking – scaled to max

SolutionThrottle! Protect your service.Read first 50,000 traces in 5 mins and raised Throttled event

Be very cautious on Retry policiesVirtuous cycles can bite

Protect your servicesAlways assume the worse will happenIt will, guaranteed.Users will find ways to crash your services, guaranteed.

Plan to throttleTimeouts, max result size, # of queries per min

Be wary of retriesAutomatic retries rarely work like you think they doRetry policies can lead to even bigger failures

Learn from your mistakes

ExampleAutoscale system would periodically kick up 4 new instances based on queue length

Queues would bleed down, repeat every 2 hours (as 2 instances would scale away)

Unknown reason why command queue length was increasing on cyclical 2 hour cycle

SolutionAdded custom counters by type of command

Found that calls to SMAPI would have average of 60 seconds and as long as 240 seconds (fastest was 30).

Routed SMAPI refresh commands to own queue and put 12 XS workers on it to processSame cost as 2 S workers6x as many commands processed

ExampleCustomer would update settings and UI would not reflect the change.Eventually, change would be updated, but history would often show multiple updates of exact same data.Customers contacted support to complain

SolutionFound that commands from customers were getting routed to same processing queue as some other much longer running commands

Re-prioritized commands coming from UI to have its own queue with dedicated resources (VIP queues)

Push your hard work to queuesApplies almost universally in CQRS worldAllows for asynchronous dispatch and eventual handling

Prioritize queuesE.g. VIP Queues for UI commandsPrevents stacking of high priority messages that update views

Alleviates front-end from blocking callsHTTP requests become highly efficient and scalable when combined with Read Optimization

Snapshot 1

Snapshot 2

SolutionFound that deserialization of event store processing grew almost exponentially depending on size and number of events.Fixed with an implementation of snapshotting

Snapshot 3

ExampleQueue length in North Europe was consistently longer than North America despite having fewer tenants (less work to do).Autoscaler was frequently and aggressively scaling up to handle the queue.But… processing time for aggregation was same or shorter than other geographies

SolutionOne North Europe tenant had used same storage account for both load testing devtest environment as production.

It was equivalent of finding a needle in haystack

Tuned our timeouts for aggregation scheduling (before aggregation).

Identifying BottlenecksInstrumentation is keyCustom performance counters is niceSimple timer that traces long commands works well too

Watch for trends over timeSnapshots might not help until you look at them over time

Profile your code when data indicates a problem

Tune and then verify changes

Optimize for reads

Even

t B

us

RegisterUser

UserRegistrationHandler

UserRegistered

UserProfileHandler

UserQuotaHandler

User Profile

Quota View

Storage

Optimize for reads

Pre-calculate and store each view that will be displayed to your users.It’s ok if data is slightly stale. Really.

Optimize for reads

Storage is cheapCreate a new view for each ‘task’ on the UIIt’s OK to have the same data in multiple views

Optimize for reads

Don’t make your web servers work hardServe JSON files directly from storage.Cache & use EtagsLight transformations on dynamic data is OK.

Learn to live with stale data

Pre-calculated views are by definition already staleIn CQRS, events raised from commands are dispatched to denormalizers from event bus


Create a view for each ‘task’ on the UIStatic views versus dynamic views


Dynamic View == Queryable dataE.g. last X hours of trace information with Y levelServed efficiently from table storage.


Static data E.g. account details, current balance, or settingsIdeal for JSON files sitting in blob storage

Work close to your dataBandwidth becomes limiting factorBetween datacentersBetween VM and NIC

Cost goes down, performance goes up.Win, Win!

Employ a message routing strategyMessages can get routed to geo-specific queuesMessages can get prioritizedMessages can get quarantined

Building for Scale RecapOptimize for reads


Work close to your data

Protect your services

Push your hard work to queues

Instrument

Why automate?Reproducibility, Reproducibility

Takes the ‘I forgot to…’ out of it.

Automate

Data migrations tie it together.

Continuous builds raise the quality bar.

Visual Studio deploys verboten.

Build Demo

Automation RecapBuild Automation

Deployment Automation

Data Migrations

PaaS vs IaaSLarge scale (> 50 instances) requires PaaS todayPaaS is a much easier deployment, upgrade, and maintenance model.Requires architecting differently – no state, idempotent, etc.

IaaS is wonderful for stateful appsPaaS FTW

Extra Small VMsHidden gem amongst the instance sizes4x cost advantageIf you can live within bandwidth and memory constraints, big bang for buck.

Websites versus WebRolesGit deployment on Windows Azure Websites is very nice

Ability to use RoleEntry point on WebRoles can be more important than ease of deploymentAbility to control dependenciesScale beyond websites and VMsSSL support is free

WebRoles FTW

Single vs Many DeploymentsMany deployments required to have geo-redundancy.Coordinating upgrades becomes challenging

Ability to dynamically route messages works to your advantageJust ‘turn off’ geo-route until upgrade completesIf datacenter is having ‘issues’, you can remove from routing

Deployment PatternsPaaS vs IaaS

Extra Small VMs

Azure Websites vs WebRoles

Single Deployment vs many

Handling outages and disasterOutages are extremely commonTrust meService degradations

Every service you use has its own SLA

The best you can do is the multiplication of each SLAE.g. 99.95 * 99.9 * 99.9 = 99.75

Your service will go down

Remediation StrategiesWeigh your riskTradeoffs abound, what is your single point of failure?Queues can helpDistributed Queues might be necessary

Geo-DistributionFault domainsMulti-datacenterMulti-cloud

RecapBuilding for Scale

Automation and Testing

Deployment Patterns

Disaster and Recovery

Web | Blog | Facebook | Twitter | LinkedIn

http://www.aditi.com/

https://www.facebook.com/AditiTechnologies

https://twitter.com/WeAreAditi

http://www.linkedin.com/company/aditi-technologies

Evaluate this session

Scan this QR code to evaluate this session.

© 2013 Microsoft Corporation. All rights reserved. Microsoft, Windows and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries.The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

Documents

ACCELERATE INNOVATIONS USING CLOUD DIFFERENTIATE WITH DESIGN AND USER EXPERIENCE DELIVER SCALE AND AGILITY TO THE CLOUD. THE RIGHT WAY. What we