90
Mastering Performance Monitoring and Capacity Planning using vRealize Operations Manager Reghuram Vasanthakumari, Staff Engineer, VMware Mohit Kataria, Product Owner, VMware

Mastering Performance Monitoring and Capacity Planning using

Embed Size (px)

Citation preview

Page 1: Mastering Performance Monitoring and Capacity Planning using

Mastering Performance Monitoring and Capacity Planning using vRealize Operations Manager

Reghuram Vasanthakumari, Staff Engineer, VMware Mohit Kataria, Product Owner, VMware

Page 2: Mastering Performance Monitoring and Capacity Planning using

Disclaimer

• This presentation may contain product features that are currently under development

• This overview of new technology represents no commitment from VMware to deliver these features in any generally available product

• Features are subject to change, and must not be included in contracts, purchase orders, or sales agreements of any kind

• Technical feasibility and market demand will affect final delivery

• Pricing and packaging for any new technologies or features discussed or presented have not been determined

2

Page 3: Mastering Performance Monitoring and Capacity Planning using

Agenda

1 Introduction to vRealize Operation Suite

2 Operations Management Goals

3 Real World Troubleshooting Scenarios

4 Q&A

3

Page 4: Mastering Performance Monitoring and Capacity Planning using

4

Page 5: Mastering Performance Monitoring and Capacity Planning using

Today’s Reality in Operations Management

Monitoring Data Overload Alert Storms

Finger Pointing

DBA

VI Storage

Over-provisioning

5

Page 6: Mastering Performance Monitoring and Capacity Planning using

Volume of Monitoring Data is Exploding

6

Metrics & Data

Volume

Traditional Stack

(Server, Storage,

Networking, Web, App

Server and DB)

Virtualized Infrastructure

(incl. Storage and

Network Virtualization)

Distributed & Mobile Apps

(incl. Public cloud, SaaS,

Mash-ups, …)

Alert

Volume

“Operations Gap”

“Commit to a comprehensive IT Operations Analytics strategy to

optimize today's operations and support future I&O work” – Gartner

Page 7: Mastering Performance Monitoring and Capacity Planning using

Evolution of Operations Analytics Technology

7

Proactive Reactive

Automated

Manual

Hyperic, SCOM,

Nagios, …

Traditional

Monitoring

Data collection

(Metrics, logs, …)

• Static thresholds

• Alerts

Predictive

Analytics

vRealize Operations

6.0

• Detect complex

issues from multiple

symptoms

• Remediation and

automation engine

• Scale-out, data-

agnostic platform

Data Collection Data collection Data collection

Event

Correlation

BMC, HP, CA,

IBM, …

• Aggregation

• Masking & filtering

• Rules-based alert

suppression

Data Collection Data collection Data collection

Performance

Analytics

VR Ops 1.0-5.x,

Netuitive, …

• Self-learning

• Dynamic thresholds

• Super metrics

Data collection Data collection

10x Alert

Reduction

Page 8: Mastering Performance Monitoring and Capacity Planning using

VMware’s Approach to Operations Analytics

8

Operations Analytics & Automation

Operations Analytics & Automation

Performance & Availability Performance & Availability

Logs & Unstructured

Data

Logs & Unstructured

Data

Topology Analysis Topology Analysis

Configuration Health

Configuration Health

Capacity Planning Capacity Planning

Page 9: Mastering Performance Monitoring and Capacity Planning using

vRealize Operations vRealize Operations

Operations Console Operations Console

Extensibility

Extensibility

Integrated Management Disciplines

Integrated Management Disciplines

Performance Performance Compliance Compliance Configuration Configuration Capacity Capacity Availability Availability

Resilient, Scale-Out Platform

Resilient, Scale-Out Platform

App Visibility App Visibility Logs* Logs* Analytics Analytics

Reporting/

Alerting

Reporting/

Alerting Automation Automation SDK SDK

Management

Packs

Management

Packs

APIs APIs

Quality of Service

Quality of Service

vRealize Operations Overview

Operational Efficiency

Operational Efficiency

Control and Compliance Control and Compliance

9

*vRealize Log Insight is not part of vRealize Operations but included with vRealize Operations Insight and vRealize Suite

Page 10: Mastering Performance Monitoring and Capacity Planning using

Agenda

1 Introduction to vRealize Operation Suite

2 Operations Management Goals

3 Real World Troubleshooting Scenarios

4 Q&A

10

Page 11: Mastering Performance Monitoring and Capacity Planning using

Status Quo Goal

• Are you able to meet or exceed service level expectations?

• Can you remediate issues before end users are impacted?

• How many monitoring tools are you using?

Quality of

Service

• What is your average Mean Time to Incident & Resolution?

• Do you manage your infrastructure capacity?

• How do you plan for future needs?

Operational

Efficiency

• Is your IT infrastructure compliant to regulatory standards?

• Can you proactively enforce IT standards in your organization?

Control

and

Compliance

Operations Management Goals

11

Page 12: Mastering Performance Monitoring and Capacity Planning using

Status Quo

• Are you able to meet or exceed service level agreements?

• Can you remediate issues before end users are impacted?

• How many monitoring tools are you using?

• What is your average Mean Time to Remediate (MTTR)?

• Do you leverage automated capacity optimization to improve

resource utilization?

• Are you able to accurately forecast your future capacity needs?

• Is your IT infrastructure compliant to regulatory standards?

• Do you have the capability to create flexible groups and policies

for different resource types and teams?

• Can you proactively enforce IT standards in your organization?

Goal

What Operations Management Teams are Looking For?

Quality of

Service

Operational

Efficiency

Control

and

Compliance

12

Page 13: Mastering Performance Monitoring and Capacity Planning using

How VMware Helps in Delivering Quality of Service

Improve performance and

avoid disruption with self-

learning management tools

Improve performance and

avoid disruption with self-

learning management tools

Key Capabilities

Benefits

90% reduction in alert volume

Proactively detect & avoid

incidents early-on

Quality of Service Quality of Service

Self-learning predictive analytics

Smart alerts identify problems

based on multiple symptoms

13

No new monitoring tools or point

products needed

Domain-specific management

packs for MS, SAP, NSX etc.

Page 14: Mastering Performance Monitoring and Capacity Planning using

• Dynamic Thresholds

• Problem Based

• 10x Alert Reduction

• Static Thresholds

• Symptom Focused

• 100s of Alerts

Traditional Monitoring Predictive Analytics Traditional Monitoring Predictive Analytics

Evolution of Traditional Monitoring towards Operational Analytics

14

Smart Alert 1

Smart Alert 2

Smart Alert 3

Smart Alert 4

Alert Storms

Problem Based Alerts combine multiple

symptoms

Page 15: Mastering Performance Monitoring and Capacity Planning using

Predictive Analytics

Problem Detection from

multiple symptoms drives

recommendation and

proactive action

Health Risk Efficiency

Dynamic Thresholds

How is VMware Self-learning Analytics Different?

15

Super Metrics

Dynamic Thresholds adapt

to workload changes and

eliminate alert storms and

false positives

Immediate

Issues Future

Issues

Optimization

Opportunities

Super Metrics combine

hundreds of KPIs into

health, risk and efficiency

scores

1 1 2 2 3 3

Page 16: Mastering Performance Monitoring and Capacity Planning using

Applying Analytics to the Past, Present and Future Infrastructure and Application Behavior

Learned Behavior Expected Demand Real-time Events

< >

Historical Data Planned Projects Predicted Behavior

Automate

Workflows

Automate

Workflows Improve Analytics &

Avoid Risk

Improve Analytics &

Avoid Risk

Identify Stress &

Improve Efficiency

Identify Stress &

Improve Efficiency

Page 17: Mastering Performance Monitoring and Capacity Planning using

vRealize

Operations

Adopting an Analytics Based Process

17

1. Identify key metrics to measure – do not focus on the UI!

2. Start with vSphere and gradually broaden scope

3. Build a library of best practices and repeatable workflows

4. Incent team to focus on issue prevention

5. Share your success with other teams

5 Steps to an Analytics Based Process

Page 18: Mastering Performance Monitoring and Capacity Planning using

Health Alert – “Performance” Troubleshooting

18

Performance alert contributing to

degraded health. Let’s click to

see details …

Performance alert contributing to

degraded health. Let’s click to

see details …

Page 19: Mastering Performance Monitoring and Capacity Planning using

Smart Alerts deliver Insight and Information

19

Correlate symptoms across

the stack

Correlate symptoms across

the stack

Page 20: Mastering Performance Monitoring and Capacity Planning using

Customize Alerts to Your Needs

20

Add remediation actions from

vCenter, vRealize Orchestrator

or Python scripts

Add remediation actions from

vCenter, vRealize Orchestrator

or Python scripts

Combine Analytics with

symptoms and recommendations

Combine Analytics with

symptoms and recommendations

Page 21: Mastering Performance Monitoring and Capacity Planning using

Status Quo

• Are you able to meet or exceed service level agreements?

• Do you user point products to manage your IT infrastructure?

• Can you remediate issues before end users are impacted?

• What is your average Mean Time to Incident & Resolution?

• Do you manage your infrastructure capacity?

• How do you plan for future needs?

• Is your IT infrastructure compliant to regulatory standards?

• Do you have the capability to create flexible groups and policies

for different resource types and teams?

• Can you proactively enforce IT standards in your organization?

Goal

What Operations Management Teams are Looking For?

Quality of

Service

Operational

Efficiency

Control

and

Compliance

21

Page 22: Mastering Performance Monitoring and Capacity Planning using

Performance Performance Higher utilization Higher utilization

Ignore Waste Ignore Waste Higher density

Higher density

safe safe

Production Test-Dev

How would you like to

manage capacity risk?

What are your goals to

optimize your environment

22

How Do You Model Your Capacity Needs?

Identify the Right Controls Identify the Right Controls

Allocation and Demand Model Allocation and Demand Model

Over-commit ratios Over-commit ratios

Thresholds for capacity risk Thresholds for capacity risk

Buffers Buffers

Business Hours Business Hours

Page 23: Mastering Performance Monitoring and Capacity Planning using

Compute Storage

70% Utilized (Just right)

90% Utilized (Danger)

Network

35% Utilized (Over Provisioned)

• Capacity Monitoring and Analytics

– Capacity modeling for heterogeneous environments

– Out-of-the-box default policy configuration flow

– Enhanced forecasting functions and granular data

23

How VMware simplifies Capacity Management

• Project Planning

– Enhanced “What-If Scenarios”

– Plan projects, visualize changes and reserve capacity for future projects

– Extensible views, reports and alert definitions for capacity

Right-size environment

Run What-If Scenarios based on business needs

Page 24: Mastering Performance Monitoring and Capacity Planning using

Capacity Analytics

CONFIDENTIAL 24

Capacity Analytics to inform when,

why, what and where

Capacity Analytics to inform when,

why, what and where

Granular breakdown of

capacity metrics for

Compute, Memory,

Network and Storage

Granular breakdown of

capacity metrics for

Compute, Memory,

Network and Storage

Page 25: Mastering Performance Monitoring and Capacity Planning using

Capacity Planning – New Project

CONFIDENTIAL 25

Add new VMs to deploy

SharePoint app into Cluster

Add new VMs to deploy

SharePoint app into Cluster

Use existing profile of VMs

to calculate capacity needs

Use existing profile of VMs

to calculate capacity needs

Based on this new project

Cluster will need more capacity

Based on this new project

Cluster will need more capacity

Page 26: Mastering Performance Monitoring and Capacity Planning using

Planning – Add Capacity

CONFIDENTIAL 26

Capacity plan is good! Capacity plan is good!

Plan another project to see how

many ESXi hosts are needed to

meet capacity shortfall

Plan another project to see how

many ESXi hosts are needed to

meet capacity shortfall

Page 27: Mastering Performance Monitoring and Capacity Planning using

Optimization – Identify Overprovisioned Resources

CONFIDENTIAL – Shared under NDA ONLY 27

Breakdown of

reclaimable capacity

Breakdown of

reclaimable capacity

Page 28: Mastering Performance Monitoring and Capacity Planning using

Automation – Take Action to Reclaim Capacity

28

One-click action

to optimize your capacity

One-click action

to optimize your capacity

Page 29: Mastering Performance Monitoring and Capacity Planning using

Status Quo

• Are you able to meet or exceed service level agreements?

• Do you user point products to manage your IT infrastructure?

• Can you remediate issues before end users are impacted?

• What is your average Mean Time to Remediate (MTTR)?

• Do you leverage automated capacity optimization to improve

resource utilization?

• Are you able to accurately forecast your future capacity needs?

• Is your IT infrastructure compliant to regulatory standards?

• Can you proactively enforce IT standards in your organization?

Goal

What Operations Management Teams are Looking For?

Quality of

Service

Operational

Efficiency

Control

and

Compliance

29

Page 30: Mastering Performance Monitoring and Capacity Planning using

How VMware Helps in Enabling More Compliance and Control

Get continuous compliance and

proactive management across

apps and infrastructure

Get continuous compliance and

proactive management across

apps and infrastructure

Key Capabilities

Benefits

Control and Compliance Control and Compliance

30

Proactive management via

flexible groups and policies

Adhere to vendor guidelines.

security best practices and

regulatory standards.

45% reduction in time spent on

ensuring compliance

Complete control with no need for

manual processes

Page 31: Mastering Performance Monitoring and Capacity Planning using

IT Compliance Challenges

31

Silo-ed Monitoring and Compliance

Monitoring Compliance

Not integrated

No Performance Correlation to Changes

Performance Changes

Managing Users and Access Controls

Need to have tight controls in place

Missing insights

Multitude of Requirements

Security Best Practices

Vendor Hardening Guidelines

Regulatory Standards

Page 32: Mastering Performance Monitoring and Capacity Planning using

VMware Covers the Spectrum of IT Compliance

32

•Achieve compliance to regulatory standards such as PCI, HIPAA etc.

•Ensure the compliance to internal IT policies and security best practices.

•Adopt latest guidelines from vendors such as Microsoft, Cisco etc.

•Deploy and operate VMware Products in a secure manner.

vSphere Security

Hardening

vSphere Security

Hardening

Vendor Best

Practices

Vendor Best

Practices

Regulatory Compliance Regulatory Compliance

Custom IT Policies

Custom IT Policies

Page 33: Mastering Performance Monitoring and Capacity Planning using

Flexible Groups and Policies

33

• Proactive Management

– Prioritize critical workloads by defining thresholds, alerts and configuration settings for specific resource groups

– Define custom policies for specific workload types, applications or clusters.

– Apply to both vSphere and non vSphere object types

– Example: Production resources vs. development resources

Page 34: Mastering Performance Monitoring and Capacity Planning using

Monitor compliance to

standards

Monitor compliance to

standards

PCI DSS Standard PCI DSS Standard

Continuous Compliance Monitoring & Enforcement

34

Take action on non-compliant items by

launching Configuration Manager

Take action on non-compliant items by

launching Configuration Manager

Page 35: Mastering Performance Monitoring and Capacity Planning using

Operations Management in the Cloud Era

Purpose built for mobile/cloud era • Self-learning predictive analytics and smart alerts

• Capacity optimization across virtual and physical stack

Policy based automation • Automated root cause analysis with compliance visibility

• Granular access control and orchestrated workflows

Fast time to value • Fast and easy deployment as a virtual appliance

• Best for vSphere and supports multi hypervisors

1

2

3

START TODAY!

“Intelligent Operations from Apps to Storage”

From the trusted market leader • Virtualization and cloud systems management leader

• The only integrated, open and comprehensive solution

4

35

Page 36: Mastering Performance Monitoring and Capacity Planning using

Agenda

36

1 Introduction to vRealize Operations Suite

2 Operations Management Goals

3 Real World Troubleshooting Scenarios

4 Q&A

Page 37: Mastering Performance Monitoring and Capacity Planning using

How do Customers find problems in their infrastructure ?

37

Search for

problem

Search for

problem Phone call /

support ticket

Phone call /

support ticket Big Visual Big Visual Blind Luck !

Start By

vR Ops God !

Alerts/Notifications Alerts/Notifications

Page 38: Mastering Performance Monitoring and Capacity Planning using

One day in the life of VMware Admin…

• A VM Owner complains to IaaS Team that her VM is slow.

• Her application architect has verified that:

– The VM CPU and RAM utilization is good.

– The disk latency is good.

– There is no network drop packets.

– No change in the application settings

– No recent patch to Windows

What do you do?

• A: Check ESXi utilization. If it’s low, tell her to doubt no more.

• B: Buy her a nice lunch + flower. Ask her to forget about it

• C: Call your VMware TAM & MCS. That’s why you pay them right?

• D: Roll up your sleeve. You are born for this!

Page 39: Mastering Performance Monitoring and Capacity Planning using

What’s wrong with these statements?

• Cluster CPU

– CPU Ratio is high at 1:5 times on cluster “XYZ”

– Rest all other cluster overcommit ratio looks good around 1:3

– Keep the over commitment ratio to 1:4.

– CPU usage is around 50% on cluster “ABCDE”. Since they are UAT servers, don’t worry.

– Rest other cluster CPU utilization is around 25%. This is good!

• Cluster RAM

– We recommend 1:2 overcommit ratio between physical RAM and virtual RAM.

– Memory Usage on most of the cluster is high around 60%

– Cluster “ABCD” is running peak at around 75%. CPU utilization should be less than 70%

– If we see that Active Mem% is also high than we should add more RAM to cluster

– % Active should not exceed 50-60% and Memory should be running at high state on each host

39

Page 40: Mastering Performance Monitoring and Capacity Planning using

Monitoring

• There are 2 levels to monitor in VMware:

– The VM.

• VM is the most important as that’s all customers care.

• They do not care about your infrastructure. It is a Service. IaaS.

– The Infra.

• Software: NSX, vCenter, VSAN, vRealize, Distributed vSwitch, Datastore

• ESXi + hardware

• Storage & Fabric

• Network

• There are 4 areas to monitor

• The 4 areas above impact one another

Page 41: Mastering Performance Monitoring and Capacity Planning using

2 distinct layer

SDDC SDDC

VM VM VM VM VM VM VM VM

VM VM VM VM VM VM VM VM

VM VM VM VM VM VM VM VM

VM VM VM VM VM VM VM VM

Performance: We check if it is being served well by the platform. Other VM is irrelevant from VM Owner point of view.

Performance: We check if it is being served well by the platform. Other VM is irrelevant from VM Owner point of view.

1 1

Capacity. We check if VM is right-sized. If too small, increase its configuration. If too big, right size it for better performance

Capacity. We check if VM is right-sized. If too small, increase its configuration. If too big, right size it for better performance

2 2

Performance: We check if IaaS is serving everyone well. Make sure there is no contention for resource among all the VMs

Performance: We check if IaaS is serving everyone well. Make sure there is no contention for resource among all the VMs

1 1

Capacity: Check utilization. Too low, we spent too much on hardware. Too high, we need to buy more hardware.

Capacity: Check utilization. Too low, we spent too much on hardware. Too high, we need to buy more hardware.

2 2

Configuration: Check for Compliance and Config Drift Availability: Get alert for hardware fault or software stop working

Configuration: Check for Compliance and Config Drift Availability: Get alert for hardware fault or software stop working

3 3

Consumer Layer

Provider Layer

Page 42: Mastering Performance Monitoring and Capacity Planning using

Performance

How do you know your IaaS is performing fast? How do you know your IaaS is performing fast?

ESXi utilization a 10% means your ESXi is fast?

ESXi utilization a 90% means your ESXi is fast?

Storage is doing 10K IOPS?

Network is processing 8 Gbps?

ESXi utilization a 10% means your ESXi is fast?

ESXi utilization a 90% means your ESXi is fast?

Storage is doing 10K IOPS?

Network is processing 8 Gbps?

What counter do you use as a proof to your customers (VM Owner)? What counter do you use as a proof to your customers (VM Owner)?

Utilization? Utilization?

Performance is measured by how well your IaaS serves the VMs.

Fast is relative to your customer. Use SLA as your defense line.

Page 43: Mastering Performance Monitoring and Capacity Planning using

Capacity

Page 44: Mastering Performance Monitoring and Capacity Planning using

Performance and Capacity Management

Performance Capacity

Focus is on the VM. In most cases, does not apply to IaaS

Focus is on the IaaS. VM Capacity Management is just right sizing

Primary counter: Contention or Latency. Utilization is largely irrelevant.

Primary counter: Contention or Latency Secondary counter: Utilization

Does not take into account Availability SLA

Takes into account Availability SLA Tier 1 is in fact Availabity-driven.

Page 45: Mastering Performance Monitoring and Capacity Planning using

The Consumer Layer The “dining area”

CONFIDENTIAL 45

Page 46: Mastering Performance Monitoring and Capacity Planning using

How a VM gets its resource

Provisioned

Limit

Reservation

Entitlement

0 vCPU or 0 GB

Contention

Usage

Demand This is the counter

we need to measure

4 vCPU or 16 GB

Page 47: Mastering Performance Monitoring and Capacity Planning using

Dashboards

• Detail monitoring of a single VM

– When customer complains that his VM is slow. Can help desk value right away?

• Large VMs Monitoring

– Because they are actually hurting your IaaS business

– This impacts both Performance and Capacity

• VM Right Sizing

• Excessive Usage

– Excessive Usage by 1-2 VM can impact the overall IaaS performance.

– VMs with excessive usage hurts the business, if we do not charge for Network and Disk IOPS

Page 48: Mastering Performance Monitoring and Capacity Planning using

Single VM Monitoring

• A VM Owner complains that his VM is slow.

– It was okay the day before

– How does Help Desk quickly determine where the issue is?

• How well does Infra serve the VM?

– VM CPU Contention

– VM RAM contention

– VM Disk latency. For each virtual disk, not average.

• Is VM undersized?

– VM CPU Utilisation

– VM RAM Consumed (not Usage)

– VM RAM Usage

– VM Disk IOPS

Page 49: Mastering Performance Monitoring and Capacity Planning using
Page 50: Mastering Performance Monitoring and Capacity Planning using
Page 51: Mastering Performance Monitoring and Capacity Planning using
Page 52: Mastering Performance Monitoring and Capacity Planning using
Page 53: Mastering Performance Monitoring and Capacity Planning using

Dashboard 1

Single VM

Monitoring

Dashboard 1

Single VM

Monitoring

Page 54: Mastering Performance Monitoring and Capacity Planning using

Are the Large VMs oversized?

• They cause performance issue

– They impact others, and also themselves!

– ESXi vmkernel scheduler has to find available cores for all the vCPU, even though they are idle.

– Other VMs maybe migrated from core to core. The counter at esxtop tracks this migration.

• Tends to have slower performance

– ESXi may not have all the available vCPU for them.

• Reduces consolidation ratio

– You can pack more vCPU with smaller VM than with big VM.

– Unless you have progressive pricing, you make more money with smaller VM as you sell more vCPU.

Page 55: Mastering Performance Monitoring and Capacity Planning using

Dashboard of Large VMs

• Overall Picture

– A line chart showing Max CPU Demand among all the Large VMs

• If this is low, they are way oversubscribed. Remember, it only takes 1 VM to make this number high.

• This number should be 80% most of the time, indicating right sizing.

– A line chart showing Average CPU Demand

• If this chart is below <25% all the time for entire month, then the large VMs are over sized.

• Heat Map of Large VMs

– Size by vCPU config. So it’s easy to see who the biggest among these large VMs.

– Color by CPU Workload. Both high and low are bad. You want to see ~50% CPU utilisation

• To differentiate between the 2 ends, choose Black and Red. Expect to see mostly green.

• Top-N CPU Demand

– Allows us to zoom into specific time to see the past

• Line chart of a selected VM (automatically plotted)

Page 56: Mastering Performance Monitoring and Capacity Planning using

As expected, the Max of All VMs is low. We can go

back in time and see over 3 months. As expected, they are mostly Black. This means

they are over provisioned.

This shows the Top 15 VM. You can change the

period to any time. This is auto shown. We are showing CPU and RAM.

You expect 70% range, not 20% like this example.

Page 57: Mastering Performance Monitoring and Capacity Planning using

CONFIDENTIAL 57

Page 58: Mastering Performance Monitoring and Capacity Planning using

CONFIDENTIAL 58

Page 59: Mastering Performance Monitoring and Capacity Planning using

Dashboard 2

Large VM

Monitoring

Dashboard 2

Large VM

Monitoring

Page 60: Mastering Performance Monitoring and Capacity Planning using

Any Excessive Utilization in our DC?

• A VM consumes 5 resources:

1. vCPU

2. vRAM (GB)

3. Disk Space

4. Disk IOPS

5. Network (Mbps)

• The first 3 you can bound and control

• The last 2 you can, but normally you don’t do it. You should.

– Application Team does not normally know how much IOPS or Network they need.

– Do you allow any VM to generate 100K IOPS?

– Do you allow any VM to saturate 1Gb link?

• Need a dashboard to track excessive usage

– Disk IOPS

– Network throughput

Page 61: Mastering Performance Monitoring and Capacity Planning using

Dashboard for Excessive Utilisation

• Excessive Storage consumption

– Line Chart:

• Max VM Disk IOPS among all VMs

• Average VM Disk IOPS

– Heat Map

• Size by IOPS. Color by Latency

• If you see a big box, that means you have a VM dominating your storage IOPS.

• Excessive Network consumption

– Similar concept as above

Page 62: Mastering Performance Monitoring and Capacity Planning using

This tracks the IOPS from VM. From here we can tell is a distinct peak. It looks like it’s coming from

1 VM, as the average is far lower. This is a cluster of 500 VM, so even if 1 VM hits 13,200 IOPS, the

average did not even pass 15 IOPS.

Let’s zoom into the peak.

Page 63: Mastering Performance Monitoring and Capacity Planning using

Excessive Storage Dashboard

The peak was 13,212 IOPS on 24 May, around 3:16 am. Let’s find out

which VM.

Page 64: Mastering Performance Monitoring and Capacity Planning using

Excessive Storage Dashboard

• We can list the Top VMs generating the IOPS on any given period.

Bingo, it was VM 63ee that did that 13212 IOPS.

Catcha!

The dashboards are great.

But it does not tell you how the IOPS distribution

among all the VMs. It also does not tell if the VMs

are experiencing high latency.

You need a Heat Map for this.

Page 65: Mastering Performance Monitoring and Capacity Planning using

At a glance, we can tell the IOPS distribution among the VMs. We can also tell if they getting low

latency or not.

Page 66: Mastering Performance Monitoring and Capacity Planning using

Dashboard 3

Excessive DC

Utilization

Dashboard 3

Excessive DC

Utilization

Page 67: Mastering Performance Monitoring and Capacity Planning using

And that’s it! You “passed” those dashboards, you’re done with the “dining area”!

67

Page 68: Mastering Performance Monitoring and Capacity Planning using

The Provider Layer The “kitchen”

CONFIDENTIAL 68

Page 69: Mastering Performance Monitoring and Capacity Planning using

Performance Management

• Overall Performance Monitoring

– Is any of our customers experiencing bad performance?

– CPU, RAM, Disk, Network

• If yes, who are affected?

– Different VM may get different impact.

– VM 007 may get hit on CPU, while VM 747 may get hit on Storage.

Page 70: Mastering Performance Monitoring and Capacity Planning using

Performance SLA Monitoring

• How do we prove that….not a single VM… in any service tier…. fails the SLA threshold we agree for that tier… in the past 1 month?

• Since VMs move around in a cluster due to DRS and HA, we need to track at Cluster level.

• If you oversubscribe, there is a risk of Contention.

– For Tier 1, do not overcommit.

– For Tier 2 and 3, do overcommit.

Page 71: Mastering Performance Monitoring and Capacity Planning using

Using Max and Average to determine how VMs are served

If the Max is: • below what you think your customers can tolerate, then you are good.

• Near the threshold, then your capacity is full. Do not add more VM.

• Above the threshold, move a few VMs out, preferably the large ones.

Page 72: Mastering Performance Monitoring and Capacity Planning using
Page 73: Mastering Performance Monitoring and Capacity Planning using
Page 74: Mastering Performance Monitoring and Capacity Planning using
Page 75: Mastering Performance Monitoring and Capacity Planning using
Page 76: Mastering Performance Monitoring and Capacity Planning using

This dashboard is good as summary. You stop here if there is no issue.

Yes, 1 dashboard!

Page 77: Mastering Performance Monitoring and Capacity Planning using

Which VMs are affected?

• The previous slides give us info at Cluster level.

– If there is no VM affected, it’s good. No need to analyse further.

– If there are VMs affected, we want to know which ones.

• We can address the above by listing the Top 30 VM

– CPU Contention

– RAM Contention

– Disk Latency

– Network drop packet (ensure it is 0)

– Network latency (this needs NetFlow)

Page 78: Mastering Performance Monitoring and Capacity Planning using

These are the top 40 VMs which

experienced the worst CPU

Contention.

These are the top 40 VMs which

experienced the worst RAM

Contention.

These are the top 40 VMs which

experienced the worst Disk

Latency.

Page 79: Mastering Performance Monitoring and Capacity Planning using

And that’s it! If Performance is ok, it’s time to review Capacity

79

Page 81: Mastering Performance Monitoring and Capacity Planning using

Performance Policy

81

Group Discussion: What should your Performance Policy be?

Page 82: Mastering Performance Monitoring and Capacity Planning using

Capacity Management: Tier 1

5 line charts showing these in the past 3 months

• Number of vCPU left in the cluster.

• Number of vRAM left in the cluster.

• Number of VM left in the cluster.

• Maximum & Average storage latency experience by any VM in the cluster

• “Usable” space left in the datastore cluster.

82

If the number is approaching low number (your threshold) for it’s time to

increase supply (e.g. IOPS, Cluster)

If the number is approaching low number (your threshold) for it’s time to

increase supply (e.g. IOPS, Cluster)

Page 83: Mastering Performance Monitoring and Capacity Planning using

Capacity Management: Tier 2 or 3

5 line charts showing data in the past 3 months

• The Maximum CPU Contention experience by any VM in the cluster.

– This number has to be lower than the SLA we promise.

• The Maximum RAM Contention experience by any VM in the cluster.

– This number has to be lower than the SLA we promise.

• The total number of VM left in the cluster.

• The Maximum & Average storage latency experience by any VM in the cluster

• The disk capacity left in the datastore cluster.

83

Page 84: Mastering Performance Monitoring and Capacity Planning using

Key Takeaways

Agree on a Performance SLA.

Contention, not Utilization.

Capacity is defined by Performance.

CONFIDENTIAL 84

Page 85: Mastering Performance Monitoring and Capacity Planning using

Thank you

Page 86: Mastering Performance Monitoring and Capacity Planning using

Appendix

86

Page 87: Mastering Performance Monitoring and Capacity Planning using

Understanding VM CPU Demand vs Usage

vSphere Reported

Cpu Usage What VM Got Right now

Contention What VM Could not Get

vROps Reported CPU Demand What VM wants

If CPU Demand What VM wants

Cpu Usage What VM Got Right now

Performance

Impact

Performance

Impact VM Has Needs Troubleshooting Troubleshooting

Page 88: Mastering Performance Monitoring and Capacity Planning using

Troubleshooting Population Pressure

Entitlement What VM can ever Get

Cpu Usage What VM Got Right now

Contention What VM Could not get

Has If VM Population

Pressure

Population

Pressure Needs Move VM Move VM

Add

Physical Capacity

Add

Physical Capacity

Page 89: Mastering Performance Monitoring and Capacity Planning using

vR Ops 6.0 Out of Box

Troubleshooting Memory

CONFIDENTIAL 89

Allocation

(No Overcommit)

Allocation

(No Overcommit)

• Most Conservative • Configured Memory • Wasteful in non Production Env

Usage

(Active)

Usage

(Active)

• Most Aggressive • Current Active Demand

Consumed

(All Touched Bits)

Consumed

(All Touched Bits)

• vSphere reported • Moderate Approach • Java, SQL, Xchange

Oracle VM • Memory Configured : 1GB • Memory Consumed : 721MB • Memory Demand : 292MB

Page 90: Mastering Performance Monitoring and Capacity Planning using

Our Philosophy Is Not your Philosophy : Mem Consumed in 6.1

91

Total Memory Touched by VM

vSphere vROps 6.1