Upload
pavelmichailov
View
31
Download
0
Embed Size (px)
Citation preview
Breaking Azure for Fun and Profit
Pavel MichailovIdentity Division
Service Challenges
Cloud Services - Resilience▪ Not a solved problem
▪ Goal is: ▪ 100% uptime▪ No degradation▪ Responsive
Cloud Services - Deployment
Cloud Services – Testing challenges▪ Continuous evolution
▪ Multiple dependencies
▪ Global distribution
▪ Traffic fluctuation
Fault Injection System
▪ Inject faults in deployed service
▪ Verify correct service response
▪ Overcome limitations of traditional testing
Agenda
System Overview
Applications
System Architecture
Target Service VMs
Fault Management
Service
Fault Agent
Fault AgentCloud Management Service
Cloud Management Service
Faults
▪ Resource pressure
▪ Network
▪ Processes
▪ Virtual machine
▪ Application specific
▪ Custom
Resource Pressure Faults
▪ CPU
▪ Memory
▪ Hard disk▪ Capacity▪ Read▪ Write
Network faults▪ Types
▪ Disconnect▪ Latency
▪ Filters▪ Domain / IP / Subnet▪ Port
Process faults
▪ Stop / Kill
▪ Restart
▪ Crash
▪ Hang
Virtual Machine / OS faults
▪ Stop
▪ Restart
▪ Re-image
▪ Machine Hang
▪ Change date
Application specific faults
▪ Hooks▪ Instrument service code
▪ Intercept / Re-route calls▪ No access to service code
Custom Faults
▪ Support for custom code execution
▪ Partner teams contribute as needed
▪ Faults subject to security review
Injection mechanism
▪ VM External
▪ VM Internal – Service code external Agent
▪ VM Internal – Service code internal Hooks
External injection▪ VM / Region Stop
▪ VM / Region Restart
▪ Re-image
Target VMTarget VM
Cloud Management Service
Cloud Management Service
VM internal injection - Agent▪ Resource pressure
▪ Network
▪ Processes
▪ OS
▪ Detours
▪ …Target Service VM
Target Application
Virtual Machine
Operating System
Fault Agent
VM internal injection - Hooks▪ Application behavior
▪ Flexibility
▪ Service specificTarget Application
Security and Safety▪ Azure AD Integration
▪ Granular access control
▪ Secure communication
▪ Kill-switch/automated removal
Applications
Resilience verification
Test new features Training
Resilience Verification
Automated Regression Testing
▪ Scheduled periodic test runs
▪ Verify alert generation
▪ Verify telemetry and service behavior
Scheduled Runs
Verify Alert Generation
▪ Integration with internal alerting system
▪ Configurable time window, expected field values
▪ Incident auto-mitigation/resolution
Verify Service Behavior
Security Verification
▪ Custom Faults▪ Local User Creation▪ Malware upload – EICAR test file
▪ Verify security alerting
New Feature Verification
▪ Fill gap in testing frameworks
▪ Manual injection of relevant faults
▪ Existing regression tests catch edge-cases
Challenges – Moving Parts
▪ Multiple unmocked components
▪ Complex scenarios difficult to verify reliably
▪ Time consuming
Challenges – Adoption
▪ Full benefit only when applied across stack
▪ Non-functional testing often deprioritized
▪ Multi-team coordination difficult
Recovery Games
Recovery Games - Planning
▪ Attacker prepares weekly fault
▪ Identify area of interest
▪ Develop and test fault
Recovery Games – During the Game
▪ Attacker injects fault, provides hints
▪ Defender assesses impact
▪ Defender provides mitigation plan
▪ Senior team members and managers observe
Recovery Games - Goals
▪ Familiarize with monitoring tools
▪ Recognize outage patterns
▪ Train on assessing the impact
▪ Root-cause / mitigation mindset
▪ Practice log analysis
Recovery Games – Issue Discovery