Enforcing Application SLA with Congress and Monasca

Preview:

Citation preview

Enforcing Application SLAs with Congress and MonascaFabio Giannetti, Ken Owens

April 28, 2016

2© 2016 Cisco and/or its affiliates. All rights reserved. Cisco Confidential

• Vision• Congress and Monasca implementing:

• OPS/NOC SLA Policies• App Intent SLA Policies

• Current State and Next Steps

Outline

3© 2016 Cisco and/or its affiliates. All rights reserved. Cisco Confidential

Vision

4© 2016 Cisco and/or its affiliates. All rights reserved. Cisco Confidential

• Application owners/developers do not care about the underlining infrastructure unless it is a problem.

• Microservices based architectures demands inherently granular application design.

• SLAs for applications must be holistic and independent of the underlining infrastructure

Vision

Host

Virtualization VirtualizationContainer Container

Container Container

Srvc Srvc Srvc Srvc Srvc Srvc Srvc

Application A Application B

5© 2016 Cisco and/or its affiliates. All rights reserved. Cisco Confidential

Enable business/application owners to easily define the aspects that are relevant in running their applications with the budget constraints that are imposed by IT.

Vision

6© 2016 Cisco and/or its affiliates. All rights reserved. Cisco Confidential

Monitoring is now holistic and has to consider various level of virtualization and harmonize data over the different layers.

Containers are short lived and moved around the available infrastructure.

Vision

Host

Virtualization VirtualizationContainer Container

Container Container

7© 2016 Cisco and/or its affiliates. All rights reserved. Cisco Confidential

Application owners’ soft limits (alarms) are notified back and hard limits (actions) are performed whenever required.

Vision

8© 2016 Cisco and/or its affiliates. All rights reserved. Cisco Confidential

OPS/NOC SLA using Congress and Monasca

9© 2016 Cisco and/or its affiliates. All rights reserved. Cisco Confidential

Underutilized Servers OPS/NOC Policy Example

error(vm, email) :-nova:server_owner(vm, owner),two_months_before_today(start, end),

ceilometer:statistics(vm, start, end, “cpu-util”, cpu),cpu < 5,keystone:email(owner, email)

two_months_before_today(start, end) :-date:today(end),date:minus(end, “2 months”, start)

If a VM has less than 5% CPU utilization for the last 2 months, then notify its owner via email

10© 2016 Cisco and/or its affiliates. All rights reserved. Cisco Confidential

Current Solution

Ceilometer API

Congress APIPolicy Engine

Ceilometer Datasource

GET /v2/meters/cpu_util/statistics?resource_id=…

VM UUID (Resource ID) CPU

xxxxxxxx-0001-xxxx-xxxxxxxxxxx

xxxxxxxx-0002-xxxx-xxxxxxxxxxx

xxxxxxxx-0003-xxxx-xxxxxxxxxxx

xxxxxxxx-0004-xxxx-xxxxxxxxxxx

xxxxxxxx-0005-xxxx-xxxxxxxxxxx

Poll every <n>s403027055

11© 2016 Cisco and/or its affiliates. All rights reserved. Cisco Confidential

Current Solution

Congress APIPolicy Engine

Ceilometer Datasource

VM UUID (Resource ID) CPU

xxxxxxxx-0001-xxxx

xxxxxxxx-0002-xxxx

xxxxxxxx-0003-xxxx

xxxxxxxx-0004-xxxx

xxxxxxxx-0005-xxxx

403027055

Nova API

Nova Datasource

Keystone Datasource

Keystone API

VM Owner

xxxxxxxx-0001-xxxx Ann

xxxxxxxx-0002-xxxx Fabio

xxxxxxxx-0003-xxxx Fabio

xxxxxxxx-0004-xxxx Ken

xxxxxxxx-0005-xxxx Ken

Owner Email

Ann AnnNotRealEmail@cisco.com

Fabio FabioNotRealEmail@cisco.com

Ken KenNotRealEmail@cisco.com

VM Email

xxxxxxxx-0003-xxxx

FabioNotRealEmail@cisco.com

12© 2016 Cisco and/or its affiliates. All rights reserved. Cisco Confidential

From Policy to Alarmerror(vm, email) :-

nova:server_owner(vm, owner),two_months_before_today(start, end),

monasca_alarms:stats(vm, start, end, “cpu.user_perc”, cpu),cpu < 5,keystone:email(owner, email)

two_months_before_today(start, end) :-date:today(end),date:minus(end, “2 months”, start)

{ "name":"Average CPU percent is less than 5", "description":"The average CPU percent is lesser than 5", "expression":"(avg(cpu.user_perc{resource_id=vm}) < 5)", "match_by":[ "resource_id" ], "severity":”HIGH", "ok_actions":[ ”action_id_for_ok" ], "alarm_actions":[ ”action_id_for_alarm" ]}

13© 2016 Cisco and/or its affiliates. All rights reserved. Cisco Confidential

Proposed Solution (receiving notif.)

MetricsDB

Monasca Agents

Monasca API

Notification Engine

Threshold Engine Persister

Kafka Cluster

Congress API

Policy Engine

Monasca Alarm Datasource

Webhook:…/v1/data-sources/monasca_alarm?execute&action=handle_alarm

Settings DB

monasca notification-create congress WEBHOOK http:…/v1/data-sources/monasca_alarm?execute&action=handle_alarm

handle_alarm(params)

VM UUID (Resource ID) CPU

xxxxxxxx-0003-xxxx 2

POST /v2.0/alarm-definitions

14© 2016 Cisco and/or its affiliates. All rights reserved. Cisco Confidential

Proposed Solution (receiving notifications)

Congress API

Policy Engine

Monasca Alarm Datasource

VM UUID (Resource ID) CPU

xxxxxxxx-0003-xxxx 2

Nova API

Nova Datasource

Keystone Datasource

Keystone API

VM Owner

xxxxxxxx-0003-xxxx Fabio

Owner Email

Fabio FabioNotRealEmail@cisco.com

VM Email

xxxxxxxx-0003-xxxx

FabioNotRealEmail@cisco.com

15© 2016 Cisco and/or its affiliates. All rights reserved. Cisco Confidential

Application Intent SLA using Congress and Monasca

16© 2016 Cisco and/or its affiliates. All rights reserved. Cisco Confidential

VM Evacuation for Biz Critical App if Host has potential health issues App Intent Policy Example

error(vm) :- nova:show(vm, hostID), monasca_alarm:host_issues(hostID)

If a Host has issues, for instance:

1. Unhealthy: cannot be pinged and or SSH into

2. Network errors and packet loss

3. Disk space below certain threshold

17© 2016 Cisco and/or its affiliates. All rights reserved. Cisco Confidential

App Intent Policy: Metrics Correlationerror(vm) :- nova:show(vm, hostID), monasca_alarm:host_issues(hostID)

Metric Name Dimensions Valuehost_alive_status observer_host=fqdn,

hostname=supplied hostname being checked,test_type=ping or ssh

0=online, 1=offline

disk.space_used_perc device, mount_point The percentage of disk space that is being used on a device

net.in_packets_dropped_sec device Number of inbound network packets dropped per second

net.out_packets_dropped_sec device Number of outbound network packets dropped per second

18© 2016 Cisco and/or its affiliates. All rights reserved. Cisco Confidential

App Intent Policy: Multi-Alarms #1{ "name":”Host is Unhealty", "description":"The host is considered unhealty", "expression":"(host_alive_status{host_id=hostID}) = 1)", "match_by":[ "host_id" ], ...}

{ "name":”Host disk getting full", "description":"The host disk is reaching capacity", "expression":"(disk.space_used_perc{host_id=hostID}) > 90)", "match_by":[ "host_id" ], ...}

Metric Name Valuehost_alive_status 0=online, 1=offline

disk.space_used_perc The percentage of disk space that is being used on a device

net.in_packets_dropped_sec Number of inbound network packets dropped per second

net.out_packets_dropped_sec

Number of outbound network packets dropped per second

19© 2016 Cisco and/or its affiliates. All rights reserved. Cisco Confidential

App Intent Policy: Multi-Alarms #2{ "name":”Host is Unhealty", "description":"The host is considered unhealty", "expression":"(net.in_packets_dropped_sec{host_id=hostID}) > 30)", "match_by":[ "host_id" ], ...}

{ "name":”Host disk getting full", "description":"The host disk is reaching capacity", "expression":"(net.out_packets_dropped_sec{host_id=hostID}) > 30)", "match_by":[ "host_id" ], ...}

Metric Name Valuehost_alive_status 0=online,

1=offline

disk.space_used_perc The percentage of disk space that is being used on a device

net.in_packets_dropped_sec Number of inbound network packets dropped per second

net.out_packets_dropped_sec Number of outbound network packets dropped per second

20© 2016 Cisco and/or its affiliates. All rights reserved. Cisco Confidential

Current State and Future Work

21© 2016 Cisco and/or its affiliates. All rights reserved. Cisco Confidential

Overall Architecture

Settings DB

MetricsDB

Monasca Agents

Monasca API

Keystone

Notification Engine

Threshold Engine Persister

Kafka Cluster

Congress APIPolicy Engine

Monasca Alarm Datasource

Metric Valuemetric1 val1

metricN valN

In Mem DB

webhookrpc

22© 2016 Cisco and/or its affiliates. All rights reserved. Cisco Confidential

• Done:• Developed a Monasca Datasource to validate integration.• Designed the solution and found the main integration points

• To be Done:• Developed a Monasca Alarm Datasource leveraging the RPC

capabilties in Congress.• Create a Congress Notification Webhook for Monasca• Develop a policy to alarm conversion component to develop

policies prefixed with monasca-alarm.

Current Status and Next Steps

OpenStack SummitAustin, Texas 2016

Thank You!

Recommended