View
508
Download
7
Category
Preview:
Citation preview
Cloud Resilience Fault Injection for Increased Resilience
Jorge Cardoso (jorge.cardoso@huawei.com) Huawei European Research Center
Riesstraße 25, 80992 München
The Butterfly Effect Project
OpenStack Munich - Cloud Resilience & Experiences with OpenStack Wednesday, April 13, 2016 6:30 PM
1
FusionSphere from Huawei
#6
2
News from OpenStack
06 April 2016
3
FAILURES ARE INEVITABLE! THE BEST WE CAN DO IS BE PREPARED FOR THEM AND LEARN FROM THEM TEST, REPAIR, LEARN & PREDICT !
4
Unplanned downtime is caused by* software bugs … 27% hardware … 23% human error … 18% network failures … 17% natural disasters … 8%
*Marcus, E., and Stern, H. Blueprints for High Availability: Designing Resilient Distributed Systems. John Wiley & Sons, Inc., 2003.
5
Google's 2007 found annualized failure rates (AFRs) for drives 1 year old 1.7% 3 year old >8.6%
Eduardo Pinheiro, Wolf-Dietrich Weber, and Luiz André Barroso. 2007. Failure trends in a large disk drive population. In Proceedings of the 5th USENIX conference on File and Storage Technologies (FAST '07). USENIX Association, Berkeley, CA, USA, 2-2.
6
One reason [Netflix]: It’s the lack of control over the underlying hardware, the inability to configure it to try to ensure 100%
uptime
Why does using a cloud infrastructure requires advanced approaches for resiliency?
7
Technology Trends
GOOGLE TRENDS CLOUD AVAILABILITY
CLOUD FAILURE
8
Chaos Monkey Randomly terminates instances in a cluster
Chaos Gorilla Simulate an Availability Zone becoming unavailable
Chaos Kong Simulate an entire region outages
Latency Monkey Introduce latency to network packets to simulate
degradation of the EC2 network
Janitor Monkey Clean up unused resources
Security Monkey Analyze and notify
on security profile changes
Netflix: Chaos Monkey
AWS recently recommended firms using its infrastructure test their resilience by using Chaos Monkey to induce failures
9
Netflix: Chaos Monkey
Fewer alerts
for ops team
Amazon EC2 and Amazon RDS Service Disruption in the US East Region April 29, 2011
September 20th, 2015 Amazon’s DynamoDB service experienced an availability issue in their US-EAST-1
Transfer traffic to east region
10
A program designed to increase resilience by purposely injecting
major failures Discover flaws and subtle dependencies
Amazon AWS: GameDay
“That seems totally bizarre on the face of it, but as you dig down, you end up finding
some dependency no one knew about previously […] We’ve had situations where we
brought down a network in, say, São Paulo, only to find that in doing so we broke our
links in Mexico.”
11
Google DIRT (Disaster Recovery Test) Annual disaster recovery & testing exercise
8 years since inception
Multi-day exercise triggering (controlled) failures in systems and process
Premise 30-day incapacitation of headquarters following a disaster
Other offices and facilities may be affected
When “Big disaster”: Annually for 3-5 days
Continuous testing: Year-round
Who 100s of engineers (Site Reliability, Network, Hardware, Software, Security, Facilities)
Business units (Human Resources, Finance, Safety, Crisis response etc.)
Google: DiRT
http://flowcon.org/dl/flowcon-sanfran-2014/slides/KripaKrishnan_LearningContinuouslyFromFailures.pdf
12
Goal
-- Butterfly Effect System -- Enables to Automatically Test and Repair OpenStack and Cloud
Applications
CLOUD APPLICATION
HUAWEI FusionSphere
The system works by intentionally injecting different failures, test the ability to
survive them, and learn how to predict and repair failures preemptively
Failure
Repair
Test
13
Use Case: OpenStack Resiliency
Kill cinder database (Simulate update failure)
Introduce delay in messages (Full-scale traffic shows where the real bottlenecks are)
Operation Error OPENSTACK_KEYSTONE_URL = "http://%s:5000/v2.0" % OPENSTACK_HOST
Operation Error /etc/nova/nova.conf Delete: auth_strategy=keystone
Remove driver to HD Remove access to NFS (Simulate hardware failure)
Best way to avoid failure: Fail constantly
The main testing framework of OpenStack is called Tempest, an opensource project with more than 2000 tests: only black-box testing (test only access the public interfaces)
14
Use Case 1: Increasing Reliability
Public Cloud
Damage Pattern
Butterfly Effect
Fix configurations Fix bugs Replace hardware Upgrade memory
Fault Type
15
Use Case 2: Run Book Automation (RBA)
Public Cloud
Incident Management
Is this really an incident?
Major Incident Procedure
Butterfly Effect
Fault Type
Damage Pattern
Recovery Script
16
MONITORING Nagios Zabbix Cacti StackTach Synaps Monasca
CONFIGURATION AUTOMATION Ansible CFEngine Chef Puppet Salt Heat
FAULT-INJECTION ENGINES DestroyStack FSaaS ChaosMonkey AnarchyApe
FAULT LIBRARIES AND PLANS pyCallGraph Intellect RunDeck Nose
DATA VISUALIZATION Kibana Graylog2 Grafana
DAMAGE DETECTION Tempest Nose
DATA STORAGE ElasticSearch OpenTSDB Neo4J Graphite Cassandra Redis
DATA AGGREGATION Logstash Collectd Flume Fluentd Heka Ceilometer
MANUAL REPAIR Bash Python Chef Puppet
AUTOMATED REPAIR jCOLIBRI myCBR Puppet Rundeck (R)?ex Chef
DATA PROCESSING Hadoop Pig Hive Spark Storm
OPERATIONS ANALYTICS Statsd R Panda Weka Machine Leaning
ALERTING Errbit Honeybadger Nagios Zabbix OpenPager Riemann
DATA SOURCE Log files Collectd Plg FlumeNG OpenStack Tbls Zabbix Agt Nagios Plg
DATA TRANSPORT rsyslog ZeroMQ
Components of a Solution
CONFIGURATION AUTOMATION Ansible CFEngine Chef Puppet Salt Heat
1
2 3
4
7
5
6
Design & Deploy
Test Infrastructure
Monitoring Facilities
Design & Execute Fault-Injection Plan
Identify Damages
Predict Future Errors
Automatic Repair
Repair & Learn
17
Technological Overview (1) Design & Deploy Test Environment
Customizable, automated OpenStack deployment
FusionServer RH2288 + VirtualBox + Vagrant + RDO
(2) Design & Execute Fault-Injection Plan Language = Python (no DSL yet)
Fault Engine = based on BPM
Fault Plan = Workflow paradigm
(3) Monitoring Facilities Monasca (from HP, RackSpace, IBM)
Visualization with Grafana
(4) Damage Detection OpenStack Tempest
1200 tests (but only API testing :( )
(5) Repair & Learn …
(6) Predict Future Errors …
(7) Automated Repair …
1
2
3
4
7
5
6
Design & Deploy
Test Infrastructure
Monitoring Facilities
Design & Execute Fault-Injection Plan
Damage Detection
Predict Future Errors
Automatic Repair
Repair & Learn
18
Design & Deploy Test Environment Customizable, automated OpenStack deployment
FusionServer RH2288 + VirtualBox + Vagrant + RDO
Deploy Test Environment
2 hours to deploy OpenStack infrastructure with 32 VMs
19
Faults to Inject Disk temporarily unavailable
unmount a disk
wait for replicas to regenerate
remount the disk with the data intact
wait for replicas to regenerate the extra replicas from handoff nodes
should get removed
Disk replacement unmount a disk
wait for replicas regenerate
delete the disk and remount it
wait for replicas to regenerate
Extra replicas from handoff nodes should get removed
Expected failure damage three disks at the same time
more if the replica count is higher
check that the replicas didn’t regenerate even after some time period
fail if the replicas regenerated
this tests if the tests themselves are correct
VM failures send VM creation request
find compute node where request was scheduled
damage to the compute server
check if the VM creation was re-scheduled to another node
3
Inject Faults
20
Damage Detection
The main testing framework of OpenStack is called Tempest, an opensource project with more than 2000 tests: only black-box testing (test only access the public interfaces)
Network tests
• create keypairs
• create security
groups
• create networks
Compute tests
• create a keypair
• create a security
group
• boot a instance
Swift tests
• create a volume
• get the volume
• delete the volume
Identity tests
…
Cinder tests
…
Glance tests
…
echo "$ tempest init cloud-01"
echo "$ cp tempest/etc/tempest.conf cloud-01/etc/"
echo "$ cd cloud-01"
echo "Next is the full test suite:"
echo "$ ostestr -c 3 --regex '(?!.*\[.*\bslow\b.*\])(^tempest\.(api|scenario))'"
echo "Next ist the minimum basic test:"
echo "$ ostestr -c 3 --regex '(?!.*\[.*\bslow\b.*\])(^tempest.scenario.test_minimum_basic)'"
21
Zabbix and ELK
22
Monasca Overview: Uses the Keystone OpenStack Identity Service for authentication,
authorization and multi-tenancy. Monasca integrates with several other
OpenStack services such as Heat for auto-scaling and Ceilometer for
monitoring OpenStack resources.
Apache Kafka: A high-throughput distributed messaging system. Kafka is a
central component in Monasca and provides the infranstructure for all internal
communications between components.
Apache Storm: A free and open source distributed realtime computation
system. Apache Storm is used in the Monasca Threshold Engine.
InfluxDB: An open-source distributed time series database with no external
dependencies. InfluxDB is one of the supported databases for storing metrics
and alarm history.
MySQL: MySQL is one of the supported databases for the Monasca Config
Database.
Grafana: An open source, feature rich metrics dashboard and graph editor.
Support for Monasca as a data source in Grafana has been added.
Anomaly Detection: Engine implements real-time streaming anomaly detection.
Two algorithms: Numenta Platform for Intelligent Computing (NuPIC) and
Kolmogorov-Smirnov (K-S) Two Sample Test. Uses Stacktach for realtime
streaming.
Performance: 3 HP Proliant SL390s G7 servers + InfluxDB cluster = 25K-30K
metrics/sec; monasca-api > 150K metrics/sec for a 3 node cluster with a load
balancing; for more performance use HP Vertica database.
See https://www.openstack.org/assets/presentation-media/Monasca-Deep-Dive-Paris-Summit.pdf
Grafana (compute_instance_create_time)
Anomaly Detection (cpu.user_perc)
23
Application Domains
24
Join the Cause!
Internship positions for MSc students
Fault injection, fault models, fault libraries, fault plans,
brake and rebuild systems all day long, …
OpenStack Engineers positions
Rapid prototyping of cool ideas: propose it today,
code it, and show it running in 3 months…
Innovative PoCs
Solving difficult challenges of real problems using
quick and dirty prototyping
Copyright©2015 Huawei Technologies Co., Ltd. All Rights Reserved.
The information in this document may contain predictive statements including, without limitation, statements regarding the future financial and operating results, future product
portfolio, new technology, etc. There are a number of factors that could cause actual results and developments to differ materially from those expressed or implied in the predictive
statements. Therefore, such information is provided for reference purpose only and constitutes neither an offer nor an acceptance. Huawei may change the information at any time
without notice.
HUAWEI ENTERPRISE ICT SOLUTIONS A BETTER WAY
Recommended