Upload
junior-kennedy
View
215
Download
0
Tags:
Embed Size (px)
Citation preview
AzureCAT: Design cloud-based solutions for operations, monitoring and troubleshooting
Silvano CorianiProgram Manager, AzureCATMicrosoft
CDP-B309
Distributed systems monitoring and diagnostics Understand application health and behaviourInstrumentation sourcesApplication telemetry
Diagnostic data collection and analysis
Practical Customer Examples
Agenda
Traditional Troubleshooting1. Customer complains app is broken2. Sys admins / DBAs log onto machine3. Manually looks at perf counters, DMVs, etc.4. Find the problem and manually fix it
This model assumesYou can wait for the expert to look at the issueYou don’t have too many issues to examine at onceThe system can be understood by a human (not too complicated)
Most of these assumptions prove false in the cloud
Monitoring and diagnostic experience
Operational scenarios
Scenario Description/Example Time Horizon Data Size
Alerting Detecting and Mitigating Problems Now Small to Large
Dashboards Service Insight Now-Recent Modest
Reports How is feature X adoption progressing day o?
Hourly/Daily Medium
Data Science Building prediction models based on past behaviors
Unlimited Very Large
Complex cloud architecture example...
100s Compute nodes, 1000s Azure SQL DBsIaaS instances and other servicesReal customer examples
Large social experience site Sochi Winter Olympics…
Cloud Customers may just quit your service if it is broken – and you would never know without the right monitoring
Cloud Troubleshooting ModelCloud apps have key differences
from traditional on-premises systems
• Internet-facing, always up
• Service SLAs – uptime requirements
• Larger scale – ISVs/SaaS vendors host all customers vs. sell/deploy each customer 1-by-1
Troubleshooting in the Cloud
• Too many machines/databases/etc. to troubleshoot manually
• Separate “mitigate” vs. “root cause”(RCA) determination
• Generate telemetry to determine RCA (later)
• Find a way to get things working ASAP (reboot/failover/whatever)
How do I find and fix problems in the cloud service before my customers are affected?
…all the time?…with potentially millions of customers?…without spending infinite money on telemetry?
…with a specific focus onAvailabilityPerformanceUsage
Rephrased Problem Statement
Understand application health and behavior
Fundamentals of Telemetry & Insight
•
•
•
•
•
•
•
•• Analyze: At a certain size tools to analyze and monitor the
system works • System for the system: Beyond that your need a system to
monitor the system
1 2 3
4
Insight Perspectives•
•
• users are active
•
• trends
•
•
•
•
•
•
Characterizing Insight•
•
•
•
••
••
•
••
•
Instrumentation sources(single and multi-instances)
Characterizing Signals
•
•
•
•
•
•
•
•
Sources of Signals
•
•
•
•
•
• Manual tracing
•
•
Your application is the best source of information if you: Log all internal/external “transactions” (DBs, web services, etc)1. Application context (module/component)2. Host context (server/role/instance/process)3. Timing information (start/stop/duration)4. Activity identifier
Correlated insights from SQL Database and other servicesConsolidate logs to central system / dashboard for health monitoring and troubleshootingEffective instrumentation requires careful planning
Capturing insights at scale
Capturing consistent application level insights
Capture timing and context information
through helper delegates (background noise)
Capture contextual errors (inner exceptions, etc) on
error
Logging library is asynchronous (fire-and-forget) to avoid blocking
Logging technologies at play
Event Tracing for Windows
(ETW)
• Native to Windows platform• Great performance & OK diagnostic tooling• Historically hard to publish events
EventSource class
• New in .NET Framework 4.5• Meant to ease authoring experience• Extensible but supports ETW-only out of the
box
Semantic Logging
Application Block (SLAB)
• Provides several destinations for events published with EventSource
• Does not require any knowledge in ETW• Additional tooling support for authoring
events
.NET Event Source
Custom Event Source
SLAB in-processObservable EventListen
erSinks
SLAB out-of-process
TraceEvent
Sinks
Third party tools (e.g. PerfView)
Event Log
ETW
Technologies at play
Several sinks/destinations:Windows Azure TableSQL DatabaseFlat fileRolling flat fileConsole
Formatters (for text-based sinks):JSONXMLNatural (plain-text)
SLAB features – sinks
Microsoft Azure Diagnostics (1.0)
Perf Counters
Windows Events
Diag Events
WAD Performance Counters Table
WAD Windows Events Logs Table
WAD Logs Table
1
2
3
4
Azure Diagnostics 1.2 for PaaS and IaaS
Data Source DescriptionIIS Logs Information about IIS web sites.
Azure Diagnostic infrastructure logs Information about Diagnostics itself.
IIS Failed Request logs Information about failed requests to an IIS site or application.
Windows Event logs Information sent to the Windows event logging system.
Performance counters Operating System and custom performance counters.
Crash dumps Information about the state of the process in the event of an application crash.
Custom error logs Logs created by your application or service.
.NET EventSource Events generated by your code using the .NET EventSource class.
Manifest based ETW ETW events generated by any process.
Azure SQL Database diagnosticsHealth (master)• sys.event_log• sys.bandwidth_usage• sys.database_connection_stats
Resource Usage• master.sys.resource_usage*• master.sys.resource_stats*• userdb.sys.dm_db_resource_s
tats
Data Access & Usage• sys.dm_db_index_usage_stats• sys.dm_db_missing_index_details• sys.dm_db_missing_index_groups• sys.dm_db_missing_index_group_stats• sys.dm_exec_sessions
Performance• sys.dm_exec_query_stats• sys.dm_exec_sql_text• sys.dm_exec_query_plan• sys.dm_exec_requests• sys.dm_db_wait_stats
Windows Azure SQL Database and SQL Server -- Performance and Scalability Compared and Contrastedhttp://msdn.microsoft.com/en-us/library/windowsazure/jj879332.aspx
Azure SQL Database diagnostic examples
Use database level DMV’s to identify top resource consumersSnapshot current requests
Order by elapsed timeDMV Details Use
sys.dm_exec_query_stats
Cumulative view of query statistics Total and average resource consumption
sys.dm_exec_query_sql_text
Returns the text of the SQL batch that is identified by the specified sql_handle
Provide overall batch text for statement
sys.dm_exec_query_plan
Returns plan in XML for specified plan handle Provide plan for tuning and analysis
sys.dm_exec_requests Current requests executing on your DB Check for blocking, contention related issues, convoys, etc
Traditional SQL Server debugging is a hands-on affair
Ad-hoc tracingQuerying DMVsPerformance Counters…
With 1 Server with few DBs, this is okWith 5 Servers, it starts to be painfulWith thousands of servers, it is impossible
Limitations of Azure SQL DB DMVs
• Look at the Top N’s• CPU / IO / Worker Time / Executions / Avgs
• Compare Queries Between Shards• Plan Changes• Resources• Executes / Hot Shards?• What is Slow?
• Look at Durations…• DML• Blocking / Waits / Throttling• One Offs
Mining data approach: N DBs
Telemetry: diagnostic data collection and analysis
This space is evolving rapidlyThe choices today will change further 12 months from now
Assume you will revisit the choices you have nowInvest in re-usable pieces, not monoliths
Scale of service usually determines the option
Not all options scale to the largest sizes
Buy vs. Build
Great for customers with an existing enterprise infrastructure
Monitoring MVC and WCF applicationsMonitoring .NET apps hosted in Windows ServicesMonitoring applications running in IIS8/Windows Server 2012APM monitoring for SharePoint 2010Introducing Azure SDK support (storage, SQL Azure)IntelliTrace integrationTFS 2010 and 2012 support
System Center APM w/ OM12 SP1
Full PaaS offeringFree / $199 (/month/server)Agent installation on server (role instance)Hooks application via Profiling APILow friction to setup and use
New Relic
Works on prem and in the cloudFree -> ~ $2578.00/mo (10 x-large instances) Agent based, hooking profiling APIGreat cross-instance correlation features
App Dynamics
Availability
Performance
Usage
Application InsightsPart of Visual Studio OnlineCurrently in previewBased on Microsoft Monitoring Agent for server performanceGreat integration with ALM
Typically these efforts evolve as follows.1. Hook up something like SQL Azure or WA Tables to store data2. Dump more and more stuff in3. Queries get slower OR you run out of space (or both )
Once you hit this limit, things get interesting and you use Big Data approaches.These work OK for reporting/data science, poorly for alerts
This leads to two systems: “Batch” pipeline and a “Streaming” pipelineWe will go through this evolution so you can see how to do each one
“Build Your Own Telemetry” Experience
Distributing data
•
•
•
•
•
•
1. Perf Counters (Compute)2. ETW Events (Compute)3. Storage metrics (Storage)4. XEvents/Errors (SQL)5. DMVs (SQL)6. Custom Tracing (application-generated events)
Expect to iterate on this – as you run your service, you find things you need and things that are not worth collecting – you tune frequently
What Data To Collect?
Let’s look at 3 example architectures:1. Log to Table Storage + SCOM Azure Management
Pack2. Blob Storage + SQL Database3. Big Data + Data Warehouse
Building Batch Pipelines
Approach 1 – Table StorageLog events from application code into WA Table StorageManually query Table Storage to find data when there is a problemPut each kind of data (errors, perf counters) in separate tablesHook up to on-premises SCOM (or similar tool) and run machines like you do on-premises
This model works fine for limited scalesOften this is the “first attempt” for telemetry systems to re-use on-premises capabilities for their first cloud deployments
Application
DB DB
TelemetrySCOM
SCOM Azure Management Pack: http://www.microsoft.com/en-us/download/details.aspx?id=11324
Generating Telemetry
• WA Table Storage: General maximum throughput is 1000 entities / partition / table
• Performance Counters:• Uses part of
timestamp as partition key (limits number of concurrent entity writes)
• Each partition key is 60 seconds wide, and are written asynchronously in bulk
Consuming Telemetry
• WA Table storage Read performance degrades with # entities/partition
• Example: Entities/Partition := (# perf counter entries) * (# role instances being monitored)
Scaling The Solution – You can extend this
approach by
• Collecting performance counters at a coarser grain (Example: 1 minute -> 5 minutes)
• Filter more records (skip WARN/INFO messages, keep ERROR)
Problems
• Some PaaS services don’t expose performance counters (Azure SQL DB, Service Bus, etc.)
Approach 1 – Details/Limitations
Application
Approach 2 – Blob Storage + SQL Database
Polling of Data Tier metrics addedTrace Analysis/Aggregation done in worker roleResults stored in another DB (SQL Azure or SQL Server)Reports/Dashboards built over Telemetry DB
DB DB
Telemetry
Telemetry
DBDMVs
Worker Role
Reports/Dashboards
Example Dashboard/Report System
CAT has developed a sample of Approach 2 called CSFCode sample package that demonstrated various guidelines to implement scalable solutions on Windows AzureIncludes a specific section on Data Access
Codehttp://code.msdn.microsoft.com/Cloud-Service-Fundamentals-4ca72649
Documentationhttp://social.technet.microsoft.com/wiki/contents/articles/17987.cloud-service-fundamentals.aspx
Cloud Service Fundamentals
Generating Telemetry
• WA Blob Storage supports higher limits (but you need to batch writes better)
• Polling DBs requires DMV diffing (which is imperfect but better than nothing)
• Multi-threading helps scale the system (to a point), but eventually you have latency
Consuming Telemetry
• Database allows use of existing tools (Reporting Services, etc.)
• Writing Dashboards initially takes some time, but it can really help
Scaling The Solution – You can extend this
approach by
• (Same as approach 1 – collect less often or collect less data)
Problems
• Eventually you want data “faster” and things slow down as you scale your service
Approach 2 – Details/Limitations
Approach 3 - Petabyte Scale Telemetry
Azure SQL Database uses a model close to thisProcessing 15+TB/day1 PB Total scaleDW stores ~10TBProcessing done 24/7/365 for over 1 million customer DBs
WA Storage
WA Storage
ETL Data Warehouse
Cluster Data Exhaust
Persist Telemetry
Partitioned Queues
Persist Curated
Data
Transform/Load Data
Warehouse
All Geo-Regions Alerting/Compute Deployment On-Premises Data Warehouse
Cluster
One Region
HDI
Job Complete
Notification
Map
-Red
uce
JobsScheduling
Hive
Pig
WA StorageCluster
Generating Telemetry
• On-Node collectors batch telemetry, write to Multiple WA Blob Storage Containers
• Per-Geo Region Accounts (collocated with service stamps in each region)
• Big Data (Hadoop or similar) system reads data across all stamps
• Aggregations/Trace Processing generate output data (to WA Blob Storage)
• ETL moves data into the DW
Consuming Telemetry
• Users Query DW with star schema (facts/dimensions) using normal DB techniques
• Reports generated for common activities needed to run the business
• Queries using Hive against Hadoop also possible
Scaling The Solution – You can extend this
approach by
• Add more cores to Hadoop
• Buy a larger DW box• Change aggregation
grain for aggregation jobs
Problems
• E2E Latency• Layers between
Hadoop world and Microsoft world (expertise in two technology stacks)
Approach 3 – Details/Limitations
Batch Pipes are great at doing things at scaleBut they are not fast – often it takes minutes to hours to process at scale
Alerting for errors is all about speed (time-to-detect)This leads to a different class of solution for “fast pipe” monitoringWe measure incidents on how long it took to detect it (every time)
We have repair bugs to keep working on that metric to be lower next time
You need to be selective about what pass through the fast pipe
Perhaps you only look at key errors or pre-aggregate valuesOtherwise you will overwhelm the alerting system
Storage efficiency is also key – I see lots of denormalized row solutions
(Near Real-Time) Alerting
Use Storm on AzureRecently introduced in AzureComplements telemetry architectures we just exposed
After you have near real-time alerts, you can do Machine Learning Applications
Auto-tuned alertsPrediction Models (for failures based on historical behaviors, etc.)Watching multiple things for errors without defined alerts
We use ML algorithms to detect new bugs in WA SQL Database
Watch all errors from all users (every minute or two)See if new kinds of errors start spikingFire alerts for errors of appropriate severity
This is far better than Firing alerts with static limits (break as your service grows)Hand-coding each limit (takes a long time)
Machine Learning
Option 1: Go get R – it is freeThen figure out how to pump lots of data through it, do alerts, etc.
Option 2: Try the Azure ML Service (not free, but easier to start)
Go author a job and try it out
Using Machine Learning
Resources
Failsafe: Guidance for Resilient Cloud Architectures (http://msdn.microsoft.com/en-us/library/jj853352.aspx)
Best Practices for the Design of Large-Scale Services on Windows Azure Cloud Services(http://msdn.microsoft.com/en-us/library/windowsazure/jj717232.aspx )
Designing and Deploying Internet Scale Serviceshttps://www.usenix.org/events/lisa07/tech/full_papers/hamilton/hamilton.pdf
Breakout SessionsCDP-B215 Build, Deploy, Manage, and Monitor Your Cloud Applications Using the New Microsoft Azure Portal
CDP-B307 Telemetry and Data Flow at Hyper-Scale: Azure Event Hub
CDP-B343 Introduction to the NEW Microsoft Azure Operational Insights Service with System Center Operations Manager
DEV-B206 Application Insights Overview: How to Keep Your Applications Available, Performing, and Succeeding
Related content
Resources
Learning
Microsoft Certification & Training Resources
www.microsoft.com/learning
Developer Network
http://developer.microsoft.com
TechNet
Resources for IT Professionals
http://microsoft.com/technet
Sessions on Demand
http://channel9.msdn.com/Events/TechEd
Come visit us in the Microsoft Solutions Experience (MSE)!Look for the Cloud and Datacenter Platform area TechExpo Hall 7
For more informationWindows Server Technical Previewhttp://technet.microsoft.com/library/dn765472.aspx
Windows Server
Microsoft Azure
Microsoft Azurehttp://azure.microsoft.com/en-us/
System Center
System Center Technical Previewhttp://technet.microsoft.com/en-us/library/hh546785.aspx
Azure Pack Azure Packhttp://www.microsoft.com/en-us/server-cloud/products/windows-azure-pack
Azure
Implementing Microsoft Azure Infrastructure Solutions
Classroomtraining
Exams
+
(Coming soon)Microsoft Azure Fundamentals
Developing Microsoft Azure Solutions
MOC
10979
Implementing Microsoft Azure Infrastructure Solutions
Onlinetraining
(Coming soon)Architecting Microsoft Azure Solutions
(Coming soon)Architecting Microsoft Azure Solutions
Developing Microsoft Azure Solutions
(Coming soon)Microsoft Azure Fundamentals
http://bit.ly/Azure-Cert
http://bit.ly/Azure-MVA
http://bit.ly/Azure-Train
Get certified for 1/2 the price at TechEd Europe 2014!http://bit.ly/TechEd-CertDeal
2 5 5MOC
20532
MOC
20533
EXAM
532EXAM
533EXAM
534
MVA MVA
Please Complete An Evaluation FormYour input is important!TechEd Schedule Builder CommNet station or PC
TechEd Mobile appPhone or Tablet
QR code
Evaluate this session
© 2014 Microsoft Corporation. All rights reserved. Microsoft, Windows, and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries.The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.