39

Building Big: Lessons Learned from Windows Azure Customers

  • Upload
    jagger

  • View
    47

  • Download
    1

Embed Size (px)

DESCRIPTION

Building Big: Lessons Learned from Windows Azure Customers. Mark Simms (@ mabsimms ), Christian Martinez Windows Azure Customer Advisory Team Building Big: Lessons Learned from Windows Azure Customers 4-554. Setting the stage. - PowerPoint PPT Presentation

Citation preview

Page 1: Building Big: Lessons Learned from Windows Azure Customers
Page 2: Building Big: Lessons Learned from Windows Azure Customers

Building Big: Lessons Learned from Windows Azure CustomersMark Simms (@mabsimms), Christian MartinezWindows Azure Customer Advisory TeamBuilding Big: Lessons Learned from Windows Azure Customers4-554

Page 3: Building Big: Lessons Learned from Windows Azure Customers

Designing resilient large-scale services requires careful design and architecture choicesIn this session we will explore key scenarios extracted from customer engagements, and what happens @ big scale. Windows Azure Customer Advisory Team (CAT) Works with internal and external customers to build out some of the largest applications on AzureGet our hands dirty on all aspects of delivery; design, implementation and all too often firefighting

This is meant to be an interactive discussion – if you don’t ask questions, we will!

This session will be customer stories, patterns & code.

We will get deeply nerdy with .NET and Azure services.

Setting the stage

Page 4: Building Big: Lessons Learned from Windows Azure Customers

Story time with Christian

Page 5: Building Big: Lessons Learned from Windows Azure Customers

A large web site, processing asynchronous work

«...

Azure Cloud Service

Web Role

Page 6: Building Big: Lessons Learned from Windows Azure Customers

100k+ connected devices publishing activity reports

Target end to end latency (including cellular link) – 8 seconds

Target throughput 5000 messages / second

Connected device(s) service, asynchronous processing

Azure Cloud Service

Web Role Worker

Service Bus

Page 7: Building Big: Lessons Learned from Windows Azure Customers

Batch receiving messages for throughput

Flag completion for individual messages

Connected device(s) service, asynchronous processing

Page 8: Building Big: Lessons Learned from Windows Azure Customers

Serialized processing – increasing latency

Batching receive for chunky communication – needed to meet throughput goalsProcessing messages in sequence drives up latency

Service Bus

QueueMessage

Batch

Process Messages

Process Message

Process Message � ..

Page 9: Building Big: Lessons Learned from Windows Azure Customers

Switch to parallel processing

Service BusQueue

Message Batch

Process Messages

Process Message

Process Message

� ..

Page 10: Building Big: Lessons Learned from Windows Azure Customers

Initial performance very smooth

App quickly spikes to 100% CPU on all cores

Execution time spikes to minutes!

Something isn’t right

Page 11: Building Big: Lessons Learned from Windows Azure Customers

Most threads blocked in FindEntry of Dictionary

Using a Dictionary to look up the message handlers

What does windbg say?

Page 12: Building Big: Lessons Learned from Windows Azure Customers

Large variations in avg/max latency

After time, processing rate drops to ~5 msg / second

CPU at ~ 0%

Something still isn’t right

Message Type 1

Message Type 2

Message Type 3

Message Type 4

Message Type 5

Message Type 6

Message Type 7

Message Type 8

00:00.0

00:04.3

00:08.6

00:12.9

00:17.3

00:21.6

00:25.9

00:30.2

Variation in Message ProcessingAvg Min Max

Page 13: Building Big: Lessons Learned from Windows Azure Customers

What does perf view have to say?

http://channel9.msdn.com/Series/PerfView-Tutorial/Tutorial-12-Wall-Clock-Time-Investigation-Basics

System.Core!System.Dynamics.Utils. TypeExtensions.GetParametersCached

Page 14: Building Big: Lessons Learned from Windows Azure Customers

Looks simple enough…Required messaging exchange patterns for queuing (pub/sub, competing consumer)Partitioning and load balancing (affinity) for queue resourcesLatency vs. throughput – batchingResources vs. latency – bounding concurrency of task executionMessage dispatch – dynamic vs. fixed function tablesPoison messages, retriesIdempotent processing

Asynchronous & queue based processingCloud Service Boundary

Load Balancer

Web Servers

Database

App Servers

Azure Queue(s)

Page 15: Building Big: Lessons Learned from Windows Azure Customers

(Very) Large scale website, backed by 500 Azure SQL databases

Physically collapsed web/app tiers to reduce latency

What can happen during periods of extreme success?

Large website, scale-out relational data storage

«...

Azure Cloud Service

Web Role

500 databases

Page 16: Building Big: Lessons Learned from Windows Azure Customers

Each cloud service has a single public IP (VIP)

Each Azure SQL Database cluster also has a single public IP

120 web role instances, 500 databases

Connection pool default size = 100

What’s the limit?

Large website, scale-out relational data storage

Azure Load Balancer

DB1 DB2 DB3

SrcIp SrcPort DestIp DestPortA.B.C.D 1 E.F.G.H 1433A.B.C.D 2 E.F.G.H 1433

Page 17: Building Big: Lessons Learned from Windows Azure Customers

(Very) Large scale website, leveraging an external service for content moderation

Protected the external service dependency with a retry policy

On average called in 0.5% of service calls

Large website, leveraging external services

«...

Azure Cloud Service

Web Role

500 databases

Content moderation

service

Page 18: Building Big: Lessons Learned from Windows Azure Customers

Too much trust in downstream services and client proxies

Not bounding non-deterministic calls

Blocking synchronous operations

No load shedding

Unintended consequences

1 2 3 4 5 6 7 8 9 10111213141516171819202122232425262728290

50

100

150

200

250

300

350

400

450Web Request Response Latency

Avg Latency Response Latency

Seco

nds

Page 19: Building Big: Lessons Learned from Windows Azure Customers

Rich clients (mobile and desktop) publishing documents for processing

Using Shared Access Signature (SAS) tokens for direct writes to storage

Looks like a good design…

Large website, asynchronous document processing

«...

Azure Cloud Service

Web Role Worker

Azure Storage Account

Blob

Queue

Page 20: Building Big: Lessons Learned from Windows Azure Customers

Storage account URI is “hard coded” into the client application

Need to update all 100k+ client applications to change storage account

Large website, asynchronous document processing

Page 21: Building Big: Lessons Learned from Windows Azure Customers

Design Choices & Challenges

Page 22: Building Big: Lessons Learned from Windows Azure Customers

Devices and Services workload – connected embedded devices and applications streaming data to the cloud100k+ devices, growing 50k / monthRegional affinity (North America only)

Optimize for the most stringent case

Simplicity is king

No one, true solution

Exploration – Data Design

Query Throughput Latency Reach

Every 30 seconds, each device publishes a status update (location, health, etc)

4k – 100k msgs/sec

2000 – 5000 ms

Single device

Every 10 minutes, a batch job retrieves all of the status updates delivered in the past 10 minutes

2M msgs / 10 minutes

2 minutes All devices

On an ad-hoc basis, a user may request the current status and recent history of all of their devices

15 requests / second

500 ms Limited device set

On an ad-hoc basis, a user may request a historical time range of all of their devices

5 requests / second

750 ms Limited device set

Page 23: Building Big: Lessons Learned from Windows Azure Customers

Cannot fulfill with a single database Exceeds transactional throughput limitData growth will exceed practical size limitsInsert heavy workloadPressure on transaction logPartitioning keys?Device ID, User account?Partitioning approachBucket, range, lookup?

Option 1: Relational – Considerations and Challenges

Page 24: Building Big: Lessons Learned from Windows Azure Customers

Periodic query spike on bulk reportingImpact to online operations (30M+ rows)RebalancingMoving data between partitions / databasesDistribution of reference data (relational model)Keeping in syncImpact of noisy neighbors (Azure SQL DB)Variable latency, pushback under heavy loadCost of management (SQL IaaS)Cost of automation for patching, maintenance

Option 1: Relational – Considerations and Challenges

Page 25: Building Big: Lessons Learned from Windows Azure Customers

Inserting large volumes of streaming data into a data storeData store is governed on number of operations (transactions)Trade consistency for throughput – enqueue, batch and publishGet: increased throughput, shift work to ”cheap” resource (app memory)Give up: full durability (potential data loss)

Tackling the Insert Challenge

Page 26: Building Big: Lessons Learned from Windows Azure Customers

Challenge: know that your site is having issues before Twitter doesThis is not a randomly chosen anecdote.Instrument, collect, analyze - reactBest: buy your way to victory (AppDynamics, New Relic, etc)Also need to instrument application effectively for ”contextual” data (aka, logging)

Tackling the Insight Challenge

Page 27: Building Big: Lessons Learned from Windows Azure Customers

Instrument for production loggingIf you didn’t log & capture it, it didn’t happenImplement inter-service monitoring and alertingNothing interesting happens on a single instance Run-time configurable loggingEnable activation (capture or delivery) of additional channels at run-timeGetting logging rightAll logging must be asynchronous Buffer and filter before pushing to remote service or store

Instrumenting Applications

Page 28: Building Big: Lessons Learned from Windows Azure Customers

Bringing down a production system with logging…

Page 29: Building Big: Lessons Learned from Windows Azure Customers

Demo: Instrumenting Applications with Event Source

Page 30: Building Big: Lessons Learned from Windows Azure Customers

STB Readiness

Option 2: Compositional Azure StorageThis isn’t a relational workloadPer-device insert and lookupPeriodic batch transferPer-device lookupNatural fit for table storage Device ID = Pk

Data type = Rk

Periodic batch transferNatural fit for blob storageInstance + Timestamp = blob idBuffer and write into blocksRoll over on time interval (10 min)

0101 1101 0111

1101 0111 ...Time/space

buffer

Pk={Device;Day}, Rk={Timestamp}Payload={fields}

Table Storage

BlobStorage

Uri={Minute;Instance}Payload={JSON Data}

Querying by device By time - direct { PkRk } lookupBy day - direct { Pk } max of 2880 records per partition Batch transfer by time frameParallel download of all blobs matching timeframe pattern Adding scale capacity20k operations per storage account,

Page 31: Building Big: Lessons Learned from Windows Azure Customers

Azure Storage Account - BlobMax blob size (block) 200 GB (50k

blocks)Max block size 4 MB

Max blob size (page) 1 TB

Max page size 512 bytes

Max bandwidth / blob 480 Mbps

Latency bounds (per operation)

100ms nominal1-3 sec duringload balancing

Scale-out unit Blob

Scale-out impedance Low

Use the appropriate blob type • Prefer block blogs with immutable / append-only data)

Use the largest practical block size• Note: network performance may require smaller blocks

for“long-haul”

For partial reads use 64 KB block size to maximize throughput

ScaleUse the appropriate blob type

• Prefer block blogs with immutable / append-only data) Use the largest practical block size

• Note: network performance may require smaller blocks for“long-haul”

Use Async Copy API for copying blobs between accounts, providers, etc

Page 32: Building Big: Lessons Learned from Windows Azure Customers

Azure Storage Account - Table

Max operations / secondper partition 5000

Max row size (names + data) 1 MB

Max column size (byte[] or string) 64 KB

Maximum number of rowsN/A (up to

storage account size limit)

Scale-out unit Table partition

Scale-out impedance Low

• Use appropriate partition keys to co-locate data (for query or batch operations) or break data into more partitions (for throughput)

• Avoid use of table storage for applications requiring non-trivial aggregation or function projection

• Store multiple types in same table for normalized queries (do not denormalize table storage schema!)

• Avoid large scans (can be very expensive!); explore use of separate (partially consistent) index table

Scale• Leverage multiple storage accounts (not

multiple tables) to increase operations/second

Page 33: Building Big: Lessons Learned from Windows Azure Customers

Azure Storage Account - Queues

Max messages in a queueN/A (up to

storage account size limit)

Max lifetime of a message 1 week (auto purged)

Max message size 64 KB

Max throughput 2000 messages / second

Scale-out unit Queue

Scale-out impedance Medium

• Optimize storage format to reduce message size / avoid 64 KB limit (for larger messages leverage Service Bus or Queues + Blob)

• Retrieve messages in batches to increase throughput

• Use dequeue count on message for poison messages

Scale• Leverage multiple queues to increase

messages / second• Vertical partitioning: split queues by function• Horizontal partitioning: split messages

between queues (round robin/direct assignment)

Page 34: Building Big: Lessons Learned from Windows Azure Customers

Services site for mobile device applications1M+ users at launch, 1M+ users added per monthFront ended by Android, iOS, Windows Phone Personalized information feeds and data setsExamples: browsing history, shopping cartAssuming up to 30% of user base can be online at any point in timeMaximum response latency 250 ms @ 99th percentile

User centric web application

Page 35: Building Big: Lessons Learned from Windows Azure Customers

Where are the scalability bottlenecks?

Where are the availability and failure points? Where are the key insight and instrumentation points?

Tearing apart the architectureCloud Service

Front End Web Role Instance Instance Instance Instance

CachingRole Instance Instance Worker

Role Instance

Databases

DB DB DB DB

Storage

StorageAccount

StorageAccount

Page 36: Building Big: Lessons Learned from Windows Azure Customers

Demo: Implementing an information publishing site

Page 37: Building Big: Lessons Learned from Windows Azure Customers

RecapKnow the numbers – platform scalability targetsCompute, storage, networking and platform servicesScalability == capacity * efficiency

Watch out for shared resources and contention pointsAt high load and concurrency “interesting” things happenDefault to asynchronous, bound all calls

Insight is power – measuring and observation of behavior Without rich telemetry and instrumentation – down to the call level – apps are running blindBuy your way to victory, leverage asynchronous and structured logging

Page 38: Building Big: Lessons Learned from Windows Azure Customers

ResourcesFailsafe: Building scalable, resilient cloud services http://channel9.msdn.com/Series/FailSafe Cloud Service Fundamentals - Reference code for Azurehttp://code.msdn.microsoft.com/Cloud-Service-Fundamentals-4ca72649 

Page 39: Building Big: Lessons Learned from Windows Azure Customers

© 2013 Microsoft Corporation. All rights reserved. Microsoft, Windows, Windows Vista and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries.The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.