Upload
stephen-barber
View
223
Download
2
Embed Size (px)
Citation preview
Mark RussinovichTechnical Fellow, Microsoft Azure
Avoiding #CloudFail: Learning Lessons from Microsoft Azure
3-615
Automate Scale out Test in production Deploy early, deploy often Always blame the compiler
Stuff Everyone Knows About Cloud Development
But there are many more rules…
But First: Basic Azure Resource Management Architecture
RDFE
Azure DB
Service Bus
Fabric ClusterFabric
ClusterFabric ClusterFabric
Cluster
StorageResourceProviders
Compute VMs
REST API
Portal
REST API
PowerShell
Pink Poodle
NSFW
Customer’s reported that portal did not show their mobile or media resources
Root cause: Change to resource provider string
comparison made it case-sensitive “mobileservice” no longer matched
“MobileService”
Dude, Where are My Mobile Services?
AM2PrdApp03 Fabric cluster did not show up in RDFE inventory
Root cause: User entered mixed-case label for Fabric cluster RDFE uses case-sensitive compare with region map
Dude, Where’s My Cluster?
Case has to be handled on a case-by-case basis
Be Sensitive About Case
Case insensitive: DNS names GUIDs URL scheme (e.g. “https” vs.
“HTTPS”) Certificate thumbprints Windows filenames
Case sensitive: Username Password XML elements JSON Email address HTTP headers and verbs User Agent Base64 encoded strings Linux filenames Azure Storage objects
Comply with industry standards/conventions
Be compatible with external systems All user-friendly text should be case-
sensitive, such as Display Name, Description, Label/tags, ... All other data should be case-insensitive
Preserve case of string parameters Document casing at string entry points
Casing Rules
A little more detail can go a long way… Error log not reporting a name made correlation
difficult:
Error message in test environment indicating a beta feature was missing was ambiguous:
Intermittent failures because of header incompatibility in test environment made troubleshooting painful:
Log As If That’s All You Have
System.Reflection.TargetInvocationException: Exception has been thrown by the target of an invocation. ---> Microsoft.ServiceModel.Web.WebProtocolException: Server Error: The service name is unknown (NotFound)
HTTP Status Code: 400. Service Management Error Code: MissingOrIncorrectVersionHeader. Message: The versioning header is not specified or was specified incorrectly.
Broke log filtering in production, causing a flood
Rules: Build with /warnaserrror Don’t suppress compiler warnings
Yes, Code Hygiene Matters, Even in the Cloudfor (size_t i=0; i< prefixes.size(); i++)
{ if (StrUtils::StartsWith(id, prefixes[i]), true) { return true; } }
EVER
It’s difficult to have universal coverage for exception handling Many layers of code written by many developers Third-party code may throw exceptions
Should you have a catch-all or fail fast? Visual Studio says not to catch all Catching an unrecoverable error can leave the
component unstable But failing fast on exceptions handling user-
controlled data can expose the service to denial of service
Exceptional Coding
Call a final handler after all known exceptions are handled Crash if it’s not user-initiated and unrecoverable Otherwise log an error and return an error
Log once-per hour per exception and report status 500
Not All Exceptions Are Created Equal
public static bool ShouldCrashOnUnhandledException(Exception e) { if (ExceptionProcessor.IsCrashingException(e)) { Logger.Instance.Alert("[UnhandledException] Process crashed due to fatal exception: {0}",
e.ToString()); return true; } else { AlertsManager.AlertOrLogException(e); return false; } }
Exceptions are expensive Rule: do not throw an exception in services for expected errors Example: exception when resource is not found
Walking the stack trace can be CPU intensive Rule: control logging the stack just once per request or on demand (for
debugging unknown errors)
Re-throwing from async EndXX method does not capture the original stack trace Using remote stack trace copy is a very expensive operation Rule: recreate the same type of exception with the original exception
as the innerException whenever possible
Exception Handling Rules
RDFE serves as front-end for Resource Providers (RPs) RPs include Websites, Traffic Manager, Azure DB
Testing of RDFE requires RP interaction Many instances of test RDFE
Not practical to deploy RP’s one-to-one with test RDFE
I’m Hooked on You
RDFE_1
SQL Azure_
1
RDFE Test
Cases
RDFE_2
SQL Azure_
2
RDFE Test
Cases
RDFE_X
SQL Azure_
X
…
RDFE Test
Cases
Workaround used in test automation was RDFE test deployments sharing storage with a well-known test RDFE instance
Problem: tests interfere and corrupt each other
Primary RDFE
SQL Azure
Shared Storag
e
management.rdfetest.dnsdemo4.com
Let’s Pretend You Only Love Me
RDFE_1
RDFE Test
Cases
RDFE_2
RDFE Test
Cases
RDFE_X
…
RDFE Test
Cases
Rule: make components capable of talking to multiple versions concurrently Updated RPs to be able to communicate with multiple RDFEs Use registration with “return address” pattern Adheres to layering extension model where dependencies are one way
Monogamous Services are so 2000’s
Test SQL Azure
Return address: 137.116.176.40
RDFE Test
Cases
RDFE_1
To: 137.116.176.40
To: 137.116.176.42
To:
13
7.1
16
.17
6.4
1
RDFE_2
RDFE Test
Cases
Retu
rn A
ddre
ss:
13
7.1
16
.17
6.4
1
RDFE Test
Cases
RDFE_X
…
Return address: 137.116.176.42
RDBug 1042990 “Testability: Expose DeleteExtensions API” Stale test VM extensions were time-consuming to cleanup and caused
bloat of stores
RDBug 764505 “[AM/CSM Design] Need UnEntitleResourceForSubscription API” Only way to test subscription RP entitlement provisioning was to delete
and recreate subscriptions
Rule: implement full CRUD for all resources
Clean Up Your Mess When You’re Done
Developer wrote code in RDFE that failed when it encountered unknown capabilities from a new Fabric version:
Caused issues when upgrade order changed
Rule: be deliberate about forward compatibility If a parameter can be ignored, ignore it If it has semantics that affect state, fail with a version error Don’t expose sematic features to down-level components
Back to the Future
Customers began to complain that newly created IaaS VMs failed to provision
Logs showed that only VMs using one of two images failed Those images had just been uploaded Test provisioning consistently failed
Dude, Where’s My VM?
A check of the images revealed that they were corrupt
Flow of image updates: Every month new OS VHD images are produced After boot performance optimization, prefetch data added to VHD Image is uploaded to platform image repository Manual test is performed by creating new VM, RDP’ing and running
tests in VM
Root cause: images corrupted during upload Not detected because human tested Stage, not Production,
environment
Rule: assume data will be corrupted in transit Use CRC64 checksums
A Case of Mistaken Identity
Users started complaining that after a VIP swap that they could not perform operations on their cloud services Was not detected by monitoring systems Affected only a small number of customers
Really? Isn’t that a Bit Much?
You can deploy two versions of a cloud service: Production: has the DNS name and IP address of the cloud service you publish Stage: has a temporary DNS name and IP address
To promote the Stage version to Production, you “VIP Swap”
What’s a VIP Swap?
Role A Role B
Port 80
Port 3389
Port 3390
Role A’ Role B’
Port 80
Port 3389
Port 3390
Production VIP – VIP1<dnsname>.cloudapp.net
Staging VIP – VIP2<guid>.cloudapp.net
RDFE uses storage table rows to cache the state of cloud service deployments Includes state of role instances and deployment slots Row is updated by mutating operations like VIP Swap It’s also updated by RDFE cache updating status of roles
Mutiple roles updated via table conditional update (opportunistic concurrency)
VIP Swap Internals
Slot VIP Role A Role B
Production 168.133.1.22 Healthy Healthy
Stage 168.124.33.22 Healthy Healthy
Slot VIP Role A Role B
Stage 168.124.33.22 Healthy Healthy
Production 168.133.1.22 Healthy Healthy
Bug in RDFE update caused race condition Change would be overwritten, causing inconsistent state
RDFE does not allow update operations when it detects inconsistency
Race condition meant error rate was only marginally higher than normal and went undetected
The VIP Swap Bug
Slot VIP Role A Role B
Production 168.133.1.22 Healthy Healthy
Stage 168.124.33.22 Healthy Healthy
Slot VIP Role A Role B
Stage 168.124.33.22 Healthy Healthy
Production 168.133.1.22 Healthy Healthy
Slot VIP Role A Role B
Stage 168.124.33.22 Healthy Healthy
Stage 168.124.33.22 Healthy Healthy
Root cause: developer claimed “unintuitive behavior of ADO.NET”
Rule: direct a slice of traffic to an updated version for several days Increase traffic gradually Set alerts based on difference in failure rates of two versions
VIP Swap Learnings
Customer Traffic
RDFE A RDFE vNext
5%
RDFE B
Root cause: developer claimed “unintuitive behavior of ADO.NET”
Rule: direct a slice of traffic to an updated version for several days Increase traffic gradually Set alerts based on difference in failure rates of two versions
VIP Swap Learnings
Customer Traffic
RDFE A RDFE vNext
30%
RDFE B
Root cause: developer claimed “unintuitive behavior of ADO.NET”
Rule: direct a slice of traffic to an updated version for several days Increase traffic gradually Set alerts based on difference in failure rates of two versions
VIP Swap Learnings
Customer Traffic
RDFE vNext
RDFE B
50%
Root cause: developer claimed “unintuitive behavior of ADO.NET”
Rule: direct a slice of traffic to an updated version for several days Increase traffic gradually Set alerts based on difference in failure rates of two versions
VIP Swap Learnings
Customer Traffic
RDFE vNext
RDFE vNext
50%
Operator was performing regular clean up of internal (Microsoft) Azure subscriptions Wasn’t aware of new process to go through front-line support Resulted in build-up of delete requests Normal process bypassed and operator allowed to submit batch-delete
Several subscriptions had originally been created internal on behalf of partners, and got deleted Front-line checks would have stopped the delete
Customers graciously let us know what had happened
Oops…
All data was fortunately recovered because of data retention From http://www.windowsazure.com/en-us/support/legal/subscription-agreement:
Subscription had to be manually recreated, however
Rule: use “soft delete” Resources inaccessible by customer Non-data resources (e.g. VMs) deleted Timer set to delete after 90 days
Let’s Not Do That Again
Storage Certificate Expiration“Sorry I’m late, the alarm clock never rang”
http://blogs.msdn.com/b/windowsazure/archive/2013/03/01/details-of-the-february-22nd-2013-windows-azure-storage-disruption.aspx
SSL connections to Azure storage began failing at 12:29pm on February 22, 2013
Customers immediately noticed
We did, too
It’s Not You, It’s Me
Certificates are managed by the “Secret Store” Once a week an automated system scans the store An alert is fired for certs within 180 days of expiration Team obtains new cert and updates Secret Store
That process was followed The breakdown:
On January 7, the storage team updated the three certs in question Failed to flag that a storage deployment had a date deadline Deployment was delayed behind other higher-priority update
We Updated It, We Promise!
The real breakdown was not monitoring production: We now scan all service endpoints, internal and external, on a weekly
basis At 90 days until expiration, shows up on VP reports
Rule: service development requires thinking through the entire life-cycle of the software
We are working on “managed service identities” to fully automate non-PKI certs
Be Certain About Your Certs
Minor #Fail Be Sensitive About Case Log As If That’s All You Have Yes, Code Hygiene Matters, Even in the Cloud Exceptional Coding I’m Hooked on You Cleanup Your Mess When You’re Done Back to the Future A Case of Mistaken Identity
Major #Fail VIP Swap Subscription Deletion Storage Certificate Expiration
Summary
Cloud development adds new rules and makes some of the old ones matter more Many rules are devops-oriented Operating large scale with loosely coupled services results in others
If you have hard-won rules to share, please email me: [email protected]
We Made These Mistakes So You Don’t Have To
Good luck, and may the force of the cloud be with you!
Your Feedback is Important
Fill out an evaluation of this session and help shape future events.
Scan the QR code to evaluate this session on your mobile device.
You’ll also be entered into a daily prize drawing!
© 2014 Microsoft Corporation. All rights reserved. Microsoft, Windows, Windows Vista and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries.The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.