ScenarioDescription/ExampleTime HorizonData Size AlertingDetecting and Mitigating ProblemsNowSmall to Large DashboardsService InsightNow-RecentModest

AzureCAT: Design cloud-based solutions for operations, monitoring and troubleshooting

Silvano CorianiProgram Manager, AzureCATMicrosoft

CDP-B309

Distributed systems monitoring and diagnostics Understand application health and behaviourInstrumentation sourcesApplication telemetry

Diagnostic data collection and analysis

Practical Customer Examples

Agenda

Traditional Troubleshooting1. Customer complains app is broken2. Sys admins / DBAs log onto machine3. Manually looks at perf counters, DMVs, etc.4. Find the problem and manually fix it

This model assumesYou can wait for the expert to look at the issueYou don’t have too many issues to examine at onceThe system can be understood by a human (not too complicated)

Most of these assumptions prove false in the cloud

Monitoring and diagnostic experience

Operational scenarios

Scenario Description/Example Time Horizon Data Size

Alerting Detecting and Mitigating Problems Now Small to Large

Dashboards Service Insight Now-Recent Modest

Reports How is feature X adoption progressing day o?

Hourly/Daily Medium

Data Science Building prediction models based on past behaviors

Unlimited Very Large

Complex cloud architecture example...

100s Compute nodes, 1000s Azure SQL DBsIaaS instances and other servicesReal customer examples

Large social experience site Sochi Winter Olympics…

Cloud Customers may just quit your service if it is broken – and you would never know without the right monitoring

Cloud Troubleshooting ModelCloud apps have key differences

from traditional on-premises systems

• Internet-facing, always up

• Service SLAs – uptime requirements

• Larger scale – ISVs/SaaS vendors host all customers vs. sell/deploy each customer 1-by-1

Troubleshooting in the Cloud

• Too many machines/databases/etc. to troubleshoot manually

• Separate “mitigate” vs. “root cause”(RCA) determination

• Generate telemetry to determine RCA (later)

• Find a way to get things working ASAP (reboot/failover/whatever)

How do I find and fix problems in the cloud service before my customers are affected?

…all the time?…with potentially millions of customers?…without spending infinite money on telemetry?

…with a specific focus onAvailabilityPerformanceUsage

Rephrased Problem Statement

Understand application health and behavior

Fundamentals of Telemetry & Insight

•

•

•

•

•

•

•

•• Analyze: At a certain size tools to analyze and monitor the

system works • System for the system: Beyond that your need a system to

monitor the system

1 2 3

4

Insight Perspectives•

•

• users are active

•

• trends

•

•

•

•

•

•

Characterizing Insight•

•

•

•

••

••

•

••

•

Instrumentation sources(single and multi-instances)

Characterizing Signals

•

•

•

•

•

•

•

•

Sources of Signals

•

•

•

•

•

• Manual tracing

•

•

Your application is the best source of information if you: Log all internal/external “transactions” (DBs, web services, etc)1. Application context (module/component)2. Host context (server/role/instance/process)3. Timing information (start/stop/duration)4. Activity identifier

Correlated insights from SQL Database and other servicesConsolidate logs to central system / dashboard for health monitoring and troubleshootingEffective instrumentation requires careful planning

Capturing insights at scale

Capturing consistent application level insights

Capture timing and context information

through helper delegates (background noise)

Capture contextual errors (inner exceptions, etc) on

error

Logging library is asynchronous (fire-and-forget) to avoid blocking

Logging technologies at play

Event Tracing for Windows

(ETW)

• Native to Windows platform• Great performance & OK diagnostic tooling• Historically hard to publish events

EventSource class

• New in .NET Framework 4.5• Meant to ease authoring experience• Extensible but supports ETW-only out of the

box

Semantic Logging

Application Block (SLAB)

• Provides several destinations for events published with EventSource

• Does not require any knowledge in ETW• Additional tooling support for authoring

events

.NET Event Source

Custom Event Source

SLAB in-processObservable EventListen

erSinks

SLAB out-of-process

TraceEvent

Sinks

Third party tools (e.g. PerfView)

Event Log

ETW

Technologies at play

Several sinks/destinations:Windows Azure TableSQL DatabaseFlat fileRolling flat fileConsole

Formatters (for text-based sinks):JSONXMLNatural (plain-text)

SLAB features – sinks

Microsoft Azure Diagnostics (1.0)

Perf Counters

Windows Events

Diag Events

WAD Performance Counters Table

WAD Windows Events Logs Table

WAD Logs Table

1

2

3

4

Azure Diagnostics 1.2 for PaaS and IaaS

Data Source DescriptionIIS Logs Information about IIS web sites.

Azure Diagnostic infrastructure logs Information about Diagnostics itself.

IIS Failed Request logs Information about failed requests to an IIS site or application.

Windows Event logs Information sent to the Windows event logging system.

Performance counters Operating System and custom performance counters.

Crash dumps Information about the state of the process in the event of an application crash.

Custom error logs Logs created by your application or service.

.NET EventSource Events generated by your code using the .NET EventSource class.

Manifest based ETW ETW events generated by any process.

http://msdn.microsoft.com/en-us/library/system.diagnostics.tracing.eventsource(v=vs.110).aspx

Azure SQL Database diagnosticsHealth (master)• sys.event_log• sys.bandwidth_usage• sys.database_connection_stats

Resource Usage• master.sys.resource_usage*• master.sys.resource_stats*• userdb.sys.dm_db_resource_s

tats

Data Access & Usage• sys.dm_db_index_usage_stats• sys.dm_db_missing_index_details• sys.dm_db_missing_index_groups• sys.dm_db_missing_index_group_stats• sys.dm_exec_sessions

Performance• sys.dm_exec_query_stats• sys.dm_exec_sql_text• sys.dm_exec_query_plan• sys.dm_exec_requests• sys.dm_db_wait_stats

Windows Azure SQL Database and SQL Server -- Performance and Scalability Compared and Contrastedhttp://msdn.microsoft.com/en-us/library/windowsazure/jj879332.aspx

http://msdn.microsoft.com/en-us/library/windowsazure/jj879332.aspx

Azure SQL Database diagnostic examples

Use database level DMV’s to identify top resource consumersSnapshot current requests

Order by elapsed timeDMV Details Use

sys.dm_exec_query_stats

Cumulative view of query statistics Total and average resource consumption

sys.dm_exec_query_sql_text

Returns the text of the SQL batch that is identified by the specified sql_handle

Provide overall batch text for statement

sys.dm_exec_query_plan

Returns plan in XML for specified plan handle Provide plan for tuning and analysis

sys.dm_exec_requests Current requests executing on your DB Check for blocking, contention related issues, convoys, etc

Traditional SQL Server debugging is a hands-on affair

Ad-hoc tracingQuerying DMVsPerformance Counters…

With 1 Server with few DBs, this is okWith 5 Servers, it starts to be painfulWith thousands of servers, it is impossible

Limitations of Azure SQL DB DMVs

• Look at the Top N’s• CPU / IO / Worker Time / Executions / Avgs

• Compare Queries Between Shards• Plan Changes• Resources• Executes / Hot Shards?• What is Slow?

• Look at Durations…• DML• Blocking / Waits / Throttling• One Offs

Mining data approach: N DBs

Telemetry: diagnostic data collection and analysis

This space is evolving rapidlyThe choices today will change further 12 months from now

Assume you will revisit the choices you have nowInvest in re-usable pieces, not monoliths

Scale of service usually determines the option

Not all options scale to the largest sizes

Buy vs. Build

Great for customers with an existing enterprise infrastructure

Monitoring MVC and WCF applicationsMonitoring .NET apps hosted in Windows ServicesMonitoring applications running in IIS8/Windows Server 2012APM monitoring for SharePoint 2010Introducing Azure SDK support (storage, SQL Azure)IntelliTrace integrationTFS 2010 and 2012 support

System Center APM w/ OM12 SP1

Full PaaS offeringFree / $199 (/month/server)Agent installation on server (role instance)Hooks application via Profiling APILow friction to setup and use

New Relic

Works on prem and in the cloudFree -> ~ $2578.00/mo (10 x-large instances) Agent based, hooking profiling APIGreat cross-instance correlation features

App Dynamics

Availability

Performance

Usage

Application InsightsPart of Visual Studio OnlineCurrently in previewBased on Microsoft Monitoring Agent for server performanceGreat integration with ALM

Typically these efforts evolve as follows.1. Hook up something like SQL Azure or WA Tables to store data2. Dump more and more stuff in3. Queries get slower OR you run out of space (or both )

Once you hit this limit, things get interesting and you use Big Data approaches.These work OK for reporting/data science, poorly for alerts

This leads to two systems: “Batch” pipeline and a “Streaming” pipelineWe will go through this evolution so you can see how to do each one

“Build Your Own Telemetry” Experience

Distributing data

•

•

•

•

•

•

1. Perf Counters (Compute)2. ETW Events (Compute)3. Storage metrics (Storage)4. XEvents/Errors (SQL)5. DMVs (SQL)6. Custom Tracing (application-generated events)

Expect to iterate on this – as you run your service, you find things you need and things that are not worth collecting – you tune frequently

What Data To Collect?

Let’s look at 3 example architectures:1. Log to Table Storage + SCOM Azure Management

Pack2. Blob Storage + SQL Database3. Big Data + Data Warehouse

Building Batch Pipelines

Approach 1 – Table StorageLog events from application code into WA Table StorageManually query Table Storage to find data when there is a problemPut each kind of data (errors, perf counters) in separate tablesHook up to on-premises SCOM (or similar tool) and run machines like you do on-premises

This model works fine for limited scalesOften this is the “first attempt” for telemetry systems to re-use on-premises capabilities for their first cloud deployments

Application

DB DB

TelemetrySCOM

SCOM Azure Management Pack: http://www.microsoft.com/en-us/download/details.aspx?id=11324

http://www.microsoft.com/en-us/download/details.aspx?id=11324

http://www.microsoft.com/en-us/download/details.aspx?id=11324

Generating Telemetry

• WA Table Storage: General maximum throughput is 1000 entities / partition / table

• Performance Counters:• Uses part of

timestamp as partition key (limits number of concurrent entity writes)

• Each partition key is 60 seconds wide, and are written asynchronously in bulk

Consuming Telemetry

• WA Table storage Read performance degrades with # entities/partition

• Example: Entities/Partition := (# perf counter entries) * (# role instances being monitored)

Scaling The Solution – You can extend this

approach by

• Collecting performance counters at a coarser grain (Example: 1 minute -> 5 minutes)

• Filter more records (skip WARN/INFO messages, keep ERROR)

Problems

• Some PaaS services don’t expose performance counters (Azure SQL DB, Service Bus, etc.)

Approach 1 – Details/Limitations

Application

Approach 2 – Blob Storage + SQL Database

Polling of Data Tier metrics addedTrace Analysis/Aggregation done in worker roleResults stored in another DB (SQL Azure or SQL Server)Reports/Dashboards built over Telemetry DB

DB DB

Telemetry

Telemetry

DBDMVs

Worker Role

Reports/Dashboards

Example Dashboard/Report System

CAT has developed a sample of Approach 2 called CSFCode sample package that demonstrated various guidelines to implement scalable solutions on Windows AzureIncludes a specific section on Data Access

Codehttp://code.msdn.microsoft.com/Cloud-Service-Fundamentals-4ca72649

Documentationhttp://social.technet.microsoft.com/wiki/contents/articles/17987.cloud-service-fundamentals.aspx

Cloud Service Fundamentals

http://code.msdn.microsoft.com/Cloud-Service-Fundamentals-4ca72649

http://social.technet.microsoft.com/wiki/contents/articles/17987.cloud-service-fundamentals.aspx

http://social.technet.microsoft.com/wiki/contents/articles/17987.cloud-service-fundamentals.aspx


• WA Blob Storage supports higher limits (but you need to batch writes better)

• Polling DBs requires DMV diffing (which is imperfect but better than nothing)

• Multi-threading helps scale the system (to a point), but eventually you have latency

Consuming Telemetry

• Database allows use of existing tools (Reporting Services, etc.)

• Writing Dashboards initially takes some time, but it can really help


approach by

• (Same as approach 1 – collect less often or collect less data)

Problems

• Eventually you want data “faster” and things slow down as you scale your service


Approach 3 - Petabyte Scale Telemetry

Azure SQL Database uses a model close to thisProcessing 15+TB/day1 PB Total scaleDW stores ~10TBProcessing done 24/7/365 for over 1 million customer DBs

WA Storage

WA Storage

ETL Data Warehouse

Cluster Data Exhaust

Persist Telemetry

Partitioned Queues

Persist Curated

Data

Transform/Load Data

Warehouse

All Geo-Regions Alerting/Compute Deployment On-Premises Data Warehouse

Cluster

One Region

HDI

Job Complete

Notification

Map

-Red

uce

JobsScheduling

Hive

Pig

WA StorageCluster


• On-Node collectors batch telemetry, write to Multiple WA Blob Storage Containers

• Per-Geo Region Accounts (collocated with service stamps in each region)

• Big Data (Hadoop or similar) system reads data across all stamps

• Aggregations/Trace Processing generate output data (to WA Blob Storage)

• ETL moves data into the DW

Consuming Telemetry

• Users Query DW with star schema (facts/dimensions) using normal DB techniques

• Reports generated for common activities needed to run the business

• Queries using Hive against Hadoop also possible


approach by

• Add more cores to Hadoop

• Buy a larger DW box• Change aggregation

grain for aggregation jobs

Problems

• E2E Latency• Layers between

Hadoop world and Microsoft world (expertise in two technology stacks)


Batch Pipes are great at doing things at scaleBut they are not fast – often it takes minutes to hours to process at scale

Alerting for errors is all about speed (time-to-detect)This leads to a different class of solution for “fast pipe” monitoringWe measure incidents on how long it took to detect it (every time)

We have repair bugs to keep working on that metric to be lower next time

You need to be selective about what pass through the fast pipe

Perhaps you only look at key errors or pre-aggregate valuesOtherwise you will overwhelm the alerting system

Storage efficiency is also key – I see lots of denormalized row solutions

(Near Real-Time) Alerting

Use Storm on AzureRecently introduced in AzureComplements telemetry architectures we just exposed

After you have near real-time alerts, you can do Machine Learning Applications

Auto-tuned alertsPrediction Models (for failures based on historical behaviors, etc.)Watching multiple things for errors without defined alerts

We use ML algorithms to detect new bugs in WA SQL Database

Watch all errors from all users (every minute or two)See if new kinds of errors start spikingFire alerts for errors of appropriate severity

This is far better than Firing alerts with static limits (break as your service grows)Hand-coding each limit (takes a long time)

Machine Learning

Option 1: Go get R – it is freeThen figure out how to pump lots of data through it, do alerts, etc.

Option 2: Try the Azure ML Service (not free, but easier to start)

Go author a job and try it out

Using Machine Learning

Resources

Failsafe: Guidance for Resilient Cloud Architectures (http://msdn.microsoft.com/en-us/library/jj853352.aspx)

Best Practices for the Design of Large-Scale Services on Windows Azure Cloud Services(http://msdn.microsoft.com/en-us/library/windowsazure/jj717232.aspx )

Designing and Deploying Internet Scale Serviceshttps://www.usenix.org/events/lisa07/tech/full_papers/hamilton/hamilton.pdf

http://msdn.microsoft.com/en-us/library/jj853352.aspx

http://msdn.microsoft.com/en-us/library/windowsazure/jj717232.aspx

https://www.usenix.org/events/lisa07/tech/full_papers/hamilton/hamilton.pdf

https://www.usenix.org/events/lisa07/tech/full_papers/hamilton/hamilton.pdf

Breakout SessionsCDP-B215 Build, Deploy, Manage, and Monitor Your Cloud Applications Using the New Microsoft Azure Portal

CDP-B307 Telemetry and Data Flow at Hyper-Scale: Azure Event Hub

CDP-B343 Introduction to the NEW Microsoft Azure Operational Insights Service with System Center Operations Manager

DEV-B206 Application Insights Overview: How to Keep Your Applications Available, Performing, and Succeeding

Related content

Resources

Learning

Microsoft Certification & Training Resources

www.microsoft.com/learning

Developer Network

http://developer.microsoft.com

TechNet

Resources for IT Professionals

http://microsoft.com/technet

Sessions on Demand

http://channel9.msdn.com/Events/TechEd

http://www.microsoft.com/learning

http://microsoft.com/msdn

http://microsoft.com/msdn

http://microsoft.com/technet



Come visit us in the Microsoft Solutions Experience (MSE)!Look for the Cloud and Datacenter Platform area TechExpo Hall 7

For more informationWindows Server Technical Previewhttp://technet.microsoft.com/library/dn765472.aspx

Windows Server

Microsoft Azure

Microsoft Azurehttp://azure.microsoft.com/en-us/

System Center

System Center Technical Previewhttp://technet.microsoft.com/en-us/library/hh546785.aspx

Azure Pack Azure Packhttp://www.microsoft.com/en-us/server-cloud/products/windows-azure-pack

http://technet.microsoft.com/library/dn765472.aspx

http://technet.microsoft.com/library/dn765472.aspx

http://azure.microsoft.com/en-us/

http://technet.microsoft.com/en-us/library/hh546785.aspx

http://technet.microsoft.com/en-us/library/hh546785.aspx

http://www.microsoft.com/en-us/server-cloud/products/windows-azure-pack



Azure

Implementing Microsoft Azure Infrastructure Solutions

Classroomtraining

Exams

+

(Coming soon)Microsoft Azure Fundamentals

Developing Microsoft Azure Solutions

MOC

10979

Implementing Microsoft Azure Infrastructure Solutions

Onlinetraining

(Coming soon)Architecting Microsoft Azure Solutions

(Coming soon)Architecting Microsoft Azure Solutions

Developing Microsoft Azure Solutions

(Coming soon)Microsoft Azure Fundamentals

http://bit.ly/Azure-Cert

http://bit.ly/Azure-MVA

http://bit.ly/Azure-Train

Get certified for 1/2 the price at TechEd Europe 2014!http://bit.ly/TechEd-CertDeal

2 5 5MOC

20532

MOC

20533

EXAM

532EXAM

533EXAM

534

MVA MVA

http://www.microsoft.com/learning/en/us/Exam.aspx?ID=70-247&Locale=en-us

http://www.microsoft.com/learning/en/us/classlocator.aspx

http://www.microsoft.com/learning/en/us/Course.aspx?ID=10750AB&Locale=en-us



http://www.microsoft.com/learning/en/us/classlocator.aspx





Please Complete An Evaluation FormYour input is important!TechEd Schedule Builder CommNet station or PC

TechEd Mobile appPhone or Tablet

QR code

Evaluate this session

© 2014 Microsoft Corporation. All rights reserved. Microsoft, Windows, and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries.The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

Documents

ScenarioDescription/ExampleTime HorizonData Size AlertingDetecting and Mitigating ProblemsNowSmall to Large DashboardsService InsightNow-RecentModest