Breaking Azure for Fun and Profit

Preview:

Citation preview

Breaking Azure for Fun and Profit

Pavel MichailovIdentity Division

Service Challenges

Cloud Services - Resilience▪ Not a solved problem

▪ Goal is: ▪ 100% uptime▪ No degradation▪ Responsive

Cloud Services - Deployment

Cloud Services – Testing challenges▪ Continuous evolution

▪ Multiple dependencies

▪ Global distribution

▪ Traffic fluctuation

Fault Injection System

▪ Inject faults in deployed service

▪ Verify correct service response

▪ Overcome limitations of traditional testing

Agenda

System Overview

Applications

System Architecture

Target Service VMs

Fault Management

Service

Fault Agent

Fault AgentCloud Management Service

Cloud Management Service

Faults

▪ Resource pressure

▪ Network

▪ Processes

▪ Virtual machine

▪ Application specific

▪ Custom

Resource Pressure Faults

▪ CPU

▪ Memory

▪ Hard disk▪ Capacity▪ Read▪ Write

Network faults▪ Types

▪ Disconnect▪ Latency

▪ Filters▪ Domain / IP / Subnet▪ Port

Process faults

▪ Stop / Kill

▪ Restart

▪ Crash

▪ Hang

Virtual Machine / OS faults

▪ Stop

▪ Restart

▪ Re-image

▪ Machine Hang

▪ Change date

Application specific faults

▪ Hooks▪ Instrument service code

▪ Intercept / Re-route calls▪ No access to service code

Custom Faults

▪ Support for custom code execution

▪ Partner teams contribute as needed

▪ Faults subject to security review

Injection mechanism

▪ VM External

▪ VM Internal – Service code external Agent

▪ VM Internal – Service code internal Hooks

External injection▪ VM / Region Stop

▪ VM / Region Restart

▪ Re-image

Target VMTarget VM

Cloud Management Service

Cloud Management Service

VM internal injection - Agent▪ Resource pressure

▪ Network

▪ Processes

▪ OS

▪ Detours

▪ …Target Service VM

Target Application

Virtual Machine

Operating System

Fault Agent

VM internal injection - Hooks▪ Application behavior

▪ Flexibility

▪ Service specificTarget Application

Security and Safety▪ Azure AD Integration

▪ Granular access control

▪ Secure communication

▪ Kill-switch/automated removal

Applications

Resilience verification

Test new features Training

Resilience Verification

Automated Regression Testing

▪ Scheduled periodic test runs

▪ Verify alert generation

▪ Verify telemetry and service behavior

Scheduled Runs

Verify Alert Generation

▪ Integration with internal alerting system

▪ Configurable time window, expected field values

▪ Incident auto-mitigation/resolution

Verify Service Behavior

Security Verification

▪ Custom Faults▪ Local User Creation▪ Malware upload – EICAR test file

▪ Verify security alerting

New Feature Verification

▪ Fill gap in testing frameworks

▪ Manual injection of relevant faults

▪ Existing regression tests catch edge-cases

Challenges – Moving Parts

▪ Multiple unmocked components

▪ Complex scenarios difficult to verify reliably

▪ Time consuming

Challenges – Adoption

▪ Full benefit only when applied across stack

▪ Non-functional testing often deprioritized

▪ Multi-team coordination difficult

Recovery Games

Recovery Games - Planning

▪ Attacker prepares weekly fault

▪ Identify area of interest

▪ Develop and test fault

Recovery Games – During the Game

▪ Attacker injects fault, provides hints

▪ Defender assesses impact

▪ Defender provides mitigation plan

▪ Senior team members and managers observe

Recovery Games - Goals

▪ Familiarize with monitoring tools

▪ Recognize outage patterns

▪ Train on assessing the impact

▪ Root-cause / mitigation mindset

▪ Practice log analysis

Recovery Games – Issue Discovery

Recommended