Doron Pinhas, CTO - wsta.org · Doron Pinhas, CTO. Or it can look like this: ... Delta airlines...

Anatomy of an IT Outage(and ways to prevent them)

IT on Wall Street SeminarSep 22, 2016

Doron Pinhas, CTO

Or it can look like this:

“System outage grounds

2300 Delta flights”

» Always unexpected

» Usually takes too long to resume service

» Way too often…• Root cause remains unknown

• Same problems reoccur

Anatomy of an IT outage

» Technology evolution and environment complexity don’t make it any easier

*Based on Continuity Software annual surveys

Are we getting better?

When was your last downtime event?

Can this be the real reason?

A closer look into recent events

Delta airlines Reported damages reach $150M (est.)

Power switch failed

Where Impact Reason given

Worldpay A million Etsy transactions affected (est.)

Server update

Southwest Airlines

2300 flights canceled Network router failed

Deutsche Bank Unable to report swap data for five days

“The issue arose from a hardware failure in the main database of the equities’ trading system”

Why does it keep happening?

Hint: Not [necessarily] for lack of trying…

Possible reasons

Design issues

Implementation issues

Testing issues

Measurement(of Quality and Risk)

• Setting up a resilient infrastructure is costly and complex– Cloud orchestration, HA (virtual HA/FT, clustering, App LB), redundant /

active-active storage, multi-pathing, teaming, Network LB, replication, Geo-HA (SRM, Stretch/Metro compute & storage), …

• The result: multiple technologies, vendors and teams

The resilient datacenter – blueprint

Site 1 Site N

Active-active / Active-passive

Site 1 Site N

The challenge? Simple math…

OS Configuration

ApplicationConfiguration

HAConfiguration

SANConfiguration

LB sessionconfiguration

VMware HA

VMware SRM

• Some changes slip through the cracks…• IT stability & quality cannot be fully tested following every change

SnapshotConfiguration

Mirror / Replica

Configuration

PuppetManifest

Active-active / Active-passive

New build / Test / Audit

Every day that goes by…

If disaster strikes tomorrow…

How confident are you that your IT will recover smoothly?

How to prevent your next outage?

Lessons learned from the best run shops

1 Design right• Manage knowledge• Rely on community (vendors, other

users)

2 Implement right • Test quality immediately

3 Make sure it stays that way (see next slides!)

Put quality control at the centerTransforms IT operations from reactive to proactive

Making sure your environment stays ready

Transform IT operations from this:

… to that

Must be automated!

New build / Test / Audit

Daily validation

About quality & risk automation

» Stamina & focus more important than pace• Each small addition goes a long way

• Start with your:– Existing manual checklists

– Most recurring issues

– Any newly discovered (& significant) risk

» Ways to automate & “shortcuts”• Use vendor scripts & built-in

validation tools

• Create your own scripts

• Use automation / configuration management tools

Fast ROI – quickly frees time to fuel your journey

Limitation: cross-domain issues will not be caught!

Tracking & enumeration

» Record results over time• Create score-cards for servers / objects

• Will allow comparing status before and after a change, and examine trends

» Mining the data enables numerous benefits• Understand what works and what does not

• Benchmark your vendors

• Make it part or your decision making process

• …

Case study – large bank

» Automating checks in design labs, pre-prod and prod

» Store results in a repository correlating:• What has changed

• Score cards

• Open issues & trending

• Business dependencies

» Customized feeds for:• IT teams

• Business owners

Case study – results

» 90% reduction in downtime

» 70% reduction in firefighting costs

» Dramatic improvement in predictability & confidence

• Tests and actual workload shifts work

• When gaps do exist, it’s easy to understand: what to do

» Better collaboration

• From finger-pointing to constant improvement

» Significant return on investment

• New use-cases uncovered regularly

The right program – key to constant improvement

» Change control process:• Better decision making through quality & risk measurement

• Shorter cycles

» Stay ahead of trouble• Automate

• Handle violations immediately

» Cross-team collaboration & visibility• Single-pane of glass for IT configuration quality, health & risk

About us

Helping many of the world’s largest

enterprises prevent outages and data

loss in their critical IT infrastructure.

Our technology & services

Early detection of availability risks and single-points-of-failure

Actionable alerts to relevant teams

Automated, cross-layer configuration validation

Easy measurement and visualization of resiliency metrics

AvailabilityGuard Services

Resilience health checks, best-practice validation

Managed Service Availability Assurance

To learn more

» Come talk to us....

» One-time health check

Thank you!(Questions?)

app.continuitysoftware.com

Doron Pinhas, CTO - wsta.org · Doron Pinhas, CTO. Or it can look like this: ... Delta airlines...

Documents

Brand Narratives & Gamification by Doron Nir

Clive and Pinhas Zusman argaining -heoretic pproach to

DDoS Protection for perimeter-less NFV-based networks€¦ · DDoS Protection for perimeter-less NFV-based networks Ehud Doron, Benny Rochwerger, David Aviv CTO Office, Radware

Christmas Activities - Helen Doron English

Doron Semantics Semitic Templates.pdf

Mental Training Coach Doron

Doron BrotEyal Cimet Supervisor:Yossi Hipsh

Helen Doron English

Welcome to Helen Doron English

doron ritter photography

Doron merdinger house_of_design

Introduction to Numerical Analysis, Doron Levy.pdf

XACML in real-world applications Doron Grinstein, CEO BiTKOO doron@bitkoo.com +1-818-985-4700 888-4-BiTKOO

Doron merdinger house of design

March 2001 CBCB The Holy Grail: Media on Demand over Multicast Doron Rajwan CTO Bandwiz

Doron Levy Curriculum Vitae

Helen Doron English Spain

Pinhas (Peter) Dartal 01 June 2010 pdrettel@hotmail 0544871990

Helen Doron Activity Book 1

Shuler and Joseph Doron and the Doron Brothers Electrical Co. · PDF fileShuler and Joseph Doron and the Doron Brothers Electrical Co. ... in early amateur radio and formed the Doron