Doron Pinhas, CTO - wsta.org · Doron Pinhas, CTO. Or it can look like this: ... Delta airlines...

Preview:

Citation preview

Anatomy of an IT Outage(and ways to prevent them)

IT on Wall Street SeminarSep 22, 2016

Doron Pinhas, CTO

Or it can look like this:

“System outage grounds

2300 Delta flights”

» Always unexpected

» Usually takes too long to resume service

» Way too often…• Root cause remains unknown

• Same problems reoccur

Anatomy of an IT outage

3

» Technology evolution and environment complexity don’t make it any easier

*Based on Continuity Software annual surveys

Are we getting better?

When was your last downtime event?

4

Can this be the real reason?

Can this be the real reason?

Can this be the real reason?

A closer look into recent events

5

Delta airlines Reported damages reach $150M (est.)

Power switch failed

Where Impact Reason given

Worldpay A million Etsy transactions affected (est.)

Server update

Southwest Airlines

2300 flights canceled Network router failed

Deutsche Bank Unable to report swap data for five days

???

6

“The issue arose from a hardware failure in the main database of the equities’ trading system”

Why does it keep happening?

Hint: Not [necessarily] for lack of trying…

7

Possible reasons

8

Design issues

Implementation issues

Testing issues

Measurement(of Quality and Risk)

• Setting up a resilient infrastructure is costly and complex– Cloud orchestration, HA (virtual HA/FT, clustering, App LB), redundant /

active-active storage, multi-pathing, teaming, Network LB, replication, Geo-HA (SRM, Stretch/Metro compute & storage), …

• The result: multiple technologies, vendors and teams

The resilient datacenter – blueprint

Site 1 Site N

Active-active / Active-passive

9

Site 1 Site N

The challenge? Simple math…

OS Configuration

ApplicationConfiguration

HAConfiguration

SANConfiguration

LB sessionconfiguration

VMware HA

VMware SRM

• Some changes slip through the cracks…• IT stability & quality cannot be fully tested following every change

SnapshotConfiguration

Mirror / Replica

Configuration

PuppetManifest

Active-active / Active-passive

10

Ris

k

Time

New build / Test / Audit

Every day that goes by…

11

If disaster strikes tomorrow…

12

How confident are you that your IT will recover smoothly?

12

How to prevent your next outage?

13

Lessons learned from the best run shops

14

1 Design right• Manage knowledge• Rely on community (vendors, other

users)

2 Implement right • Test quality immediately

3 Make sure it stays that way (see next slides!)

Put quality control at the centerTransforms IT operations from reactive to proactive

Making sure your environment stays ready

15

Transform IT operations from this:

… to that

Must be automated!

Ris

k

Time

New build / Test / Audit

Daily validation

About quality & risk automation

» Stamina & focus more important than pace• Each small addition goes a long way

• Start with your:– Existing manual checklists

– Most recurring issues

– Any newly discovered (& significant) risk

» Ways to automate & “shortcuts”• Use vendor scripts & built-in

validation tools

• Create your own scripts

• Use automation / configuration management tools

16

Fast ROI – quickly frees time to fuel your journey

Limitation: cross-domain issues will not be caught!

Tracking & enumeration

» Record results over time• Create score-cards for servers / objects

• Will allow comparing status before and after a change, and examine trends

» Mining the data enables numerous benefits• Understand what works and what does not

• Benchmark your vendors

• Make it part or your decision making process

• …

17

Case study – large bank

» Automating checks in design labs, pre-prod and prod

» Store results in a repository correlating:• What has changed

• Score cards

• Open issues & trending

• Business dependencies

» Customized feeds for:• IT teams

• Business owners

18

Case study – results

» 90% reduction in downtime

» 70% reduction in firefighting costs

» Dramatic improvement in predictability & confidence

• Tests and actual workload shifts work

• When gaps do exist, it’s easy to understand: what to do

» Better collaboration

• From finger-pointing to constant improvement

» Significant return on investment

• New use-cases uncovered regularly

19

The right program – key to constant improvement

» Change control process:• Better decision making through quality & risk measurement

• Shorter cycles

» Stay ahead of trouble• Automate

• Handle violations immediately

» Cross-team collaboration & visibility• Single-pane of glass for IT configuration quality, health & risk

20

About us

21

About us

Helping many of the world’s largest

enterprises prevent outages and data

loss in their critical IT infrastructure.

22

Our technology & services

23

Early detection of availability risks and single-points-of-failure

Actionable alerts to relevant teams

Automated, cross-layer configuration validation

Easy measurement and visualization of resiliency metrics

AvailabilityGuard Services

Resilience health checks, best-practice validation

Managed Service Availability Assurance

To learn more

» Come talk to us....

» One-time health check

24

Thank you!(Questions?)

app.continuitysoftware.com

25