Planning to Fail #phpne13

Planningto fail

@davegardnerisme#phpne13

the taxi app

Planningto fail

Planningfor failure

Planningto fail

http://en.wikipedia.org/wiki/High_availability

99.9% (three nines)

Downtime:

43.8 minutes per month8.76 hours per year

99.99% (four nines)

Downtime:

4.32 minutes per month52.56 minutes per year

99.999% (five nines)

Downtime:

25.9 seconds per month5.26 minutes per year

www.whoownsmyavailability.com

The beginning

My website: single VPS running PHP + MySQL

No growth, low volume, simple functionality, one engineer (me!)

Large growth, high volume, complex functionality, lots of engineers

• Launched in LondonNovember 2011

• Now in 5 cities in 3 countries (30%+ growth every month)

• A Hailo hail is accepted around the world every 5 seconds

“.. Brooks [1] reveals that the complexity of a software project grows as the square of the number of engineers and Leveson [17] cites evidence that most failures in complex systems result from unexpected inter-component interaction rather than intra-component bugs, we conclude that less machinery is (quadratically) better.”

http://lab.mscs.mu.edu/Dist2012/lectures/HarvestYield.pdf

• SOA (10+ services)

• AWS (3 regions, 9 AZs, lots of instances)

• 10+ engineers building services

and you?(hailo is hiring)

Our overall reliability is in

danger

Embracing failure

(a coping strategy)

VPC(running PHP+MySQL)

reliable?

Reliable!==

Resilient

Choosing a stack

“Hailo”(running PHP+MySQL)

reliable?

Service

each service does one job well

Service Service Service

Service Oriented Architecture

• Fewer lines of code

• Fewer responsibilities

• Changes less frequently

• Can swap entire implementation if needed

Service(running PHP+MySQL)

reliable?

Service MySQL

MySQL running on different box

Service

MySQL running in Multi-Master mode

Going global

Separating concerns

CRUDLockingSearchAnalyticsID generation

also queuing…

At Hailo we look for technologies that are:

• Distributedrun on more than one machine

• Homogenousall nodes look the same

• Resilientcan cope with the loss of node(s) with no loss of data

“There is no such thing as standby infrastructure: there is stuff you always use and stuff that won’t work when you need it.”

http://blog.b3k.us/2012/01/24/some-rules.html

• Highly performant, scalable and resilient data store

• Underpins much of what we do at Hailo

• Makes multi-DC easy!

• Highly reliable distributed coordination

• We implement locking and leadership election on top of ZK and use sparingly

ZooKeeper

• Distributed, RESTful, Search Engine built on top of Apache Lucene

• Replaced basic foo LIKE ‘%bar%’ queries (so much better)

• Realtime message processing system designed to handle billions of messages per day

• Fault tolerant, highly available with reliable message delivery guarantee

• Real time incremental analytics platform, backed by Apache Cassandra

• Powerful SQL-like interface

• Scalable and highly available

• Distributed ID generation with no coordination required

• Rock solid

Cruftflake

• All these technologies have similar properties of distribution and resilience

• They are designed to cope with failure

• They are not broken by design

Lessons learned

Minimise the critical path

What is the minimum viable service?

class HailoMemcacheService { private $mc = null;

public function __call() { $mc = $this->getInstance(); // do stuff }

private function getInstance() { if ($this->instance === null) { $this->mc = new \Memcached; $this->mc->addServers($s); } return $this->mc; }} Lazy-init instances; connect on use

Configure clients carefully

$this->mc = new \Memcached;$this->mc->addServers($s);

$this->mc->setOption( \Memcached::OPT_CONNECT_TIMEOUT, $connectTimeout);$this->mc->setOption( \Memcached::OPT_SEND_TIMEOUT, $sendRecvTimeout);$this->mc->setOption( \Memcached::OPT_RECV_TIMEOUT, $sendRecvTimeout);$this->mc->setOption( \Memcached::OPT_POLL_TIMEOUT, $connectionPollTimeout);

Make sure timeouts are configured

Choose timeouts based on data

“Fail Fast: Set aggressive timeouts such that failing components don’t make the entire system crawl to a halt.”

http://techblog.netflix.com/2011/04/lessons-netflix-learned-from-aws-outage.html

95th percentile

• Kill memcache on box A, measure impact on application

• Kill memcache on box B, measure impact on application

All fine.. we’ve got this covered!

• Box A, running in AWS, locks up

• Any parts of application that touch Memcache stop working

Things fail in exotic ways

$ iptables -A INPUT -i eth0 \ -p tcp --dport 11211 -j REJECT

$ php test-memcache.php

Working OK!

Packets rejected and source notified by ICMP. Expect fast fails.

$ iptables -A INPUT -i eth0 \ -p tcp --dport 11211 -j DROP

Working OK!

Packets silently dropped. Expect long time outs.

$ iptables -A INPUT -i eth0 \ -p tcp --dport 11211 \ -m state --state ESTABLISHED \ -j DROP

Hangs! Uh oh.

• When AWS instances hang they appear to accept connections but drop packets

• Bug!

https://bugs.launchpad.net/libmemcached/+bug/583031

Fix, rinse, repeat

RabbitMQ RabbitMQ RabbitMQ

Service

AMQP (port 5672)

HA cluster

$ php test-rabbitmq.php

Fantastic! Block AMQP port, client times out

“RabbitMQ clusters do not tolerate network partitions well.”

http://www.rabbitmq.com/partitions.html

$ epmd –namesepmd: up and running on port 4369 with data:name rabbit at port 60278

Each node listens on a port assigned by EPMD

$ php test-rabbitmq.php

Hangs! Uh oh.

Mnesia('rabbit@dmzutilities03-global01-test'): ** ERROR ** mnesia_event got {inconsistent_database, running_partitioned_network, 'rabbit@dmzutilities01-global01-test'}

application: rabbitmq_managementexited: shutdowntype: temporary

RabbitMQ logs show partitioned network error; nodes shutdown

while ($read < $n && !feof($this->sock->real_sock()) && (false !== ($buf = fread( $this->sock->real_sock(), $n - $read)))) { $read += strlen($buf); $res .= $buf;}

PHP library didn’t have any time limit on reading a frame

Fix, rinse, repeat

It would be nice if we couldautomate this

Automate!

• Hailo run a dedicated automated test environment

• Powered by bash, JMeter and Graphite

• Continuous automated testing with failure simulations

Fix attempt 1: bad timeouts configured

Fix attempt 2: better timeouts

Simulate in system tests

Simulate failure

Assert monitoring endpoint picks this up

Assert features still work

In conclusion

“the best way to avoid failure is to fail constantly.”

http://www.codinghorror.com/blog/2011/04/working-with-the-chaos-monkey.html

You should test for failure

How does the software react?How does the PHP client react?

Automation makes continuous failure testing feasible

Systems that cope well with failure are easier to operate

TIMED BLOCK ALL THE THINGS

Thanks

Software used at Hailo

http://cassandra.apache.org/http://zookeeper.apache.org/http://www.elasticsearch.org/http://www.acunu.com/acunu-analytics.htmlhttps://github.com/bitly/nsqhttps://github.com/davegardnerisme/cruftflakehttps://github.com/davegardnerisme/nsqphp

Plus a load of other things I’ve not mentioned.

Planning to Fail #phpne13

Technology

Business Planning - indico.cern.ch · Business Planning Failing to plan is planning to fail. The Business Model Canvas A tool to describe, challenge and improve your business model

“If you fail to plan, you are planning to fail” Can …idev.afdb.org/sites/default/files/documents/files/Article...“If you fail to plan, you are planning to fail” Can business

HAVE A PLAN, OR PLAN TO FAIL Strategic Planning Using the Balanced Scorecard 1

Strategically Planning Lesson 4. Strategically Planning Fail to plan is a plain that fail

PLANNING TO FAIL - Diakonia · Diakonia IhL resource Centre – Planning to Fail List of acronyms AHlC Ad hoc Liaison Committee COGAT The Coordinator of (the Israeli) Government Activities

If you do not have a plan, your planning to fail

AFEC INTERNATIONAL |Who We Are · AFEC INTERNATIONAL |Planning |Practical Steps Step 1 –Planning ”fail to plan plan to fail. ” oDefine the job –Brief the potential contractors

Define.xml Review: Failing to Plan is Planning to Fail · 2019. 11. 26. · 1 Paper DS 10 Define.xml Review: Failing to Plan is Planning to Fail Robin Mann, GCE Solutions, Chandigarh,

Crisis Management Framework: Whose Responsibility is ... Management Framework.pdf · Contingency Planning and Crisis Management ‘If you fail to plan, you are planning to fail’

Planning Effective Lessons If you fail to plan, you plan to fail. Module 9

Failing to plan is planning to fail Managing Product Design Projects

PRE-PLANNING NEEDS A PLAN - ESO · 2018. 4. 5. · PRE-PLANNING NEEDS A PLAN As the old saying goes, “if you fail to plan, you plan to fail.” Implementing a solid pre-plans program

Failing to plan is planning to fail

210mm Fail to Prepare, Prepare to Fail – Business

Fail Proof Pathways to Success STRATEGIC PLAN 2020 · “Predicted to Fail.” The results of this year of review, discussion and debate led to the formation of this planning agenda

Project Planning Day 2 An Old Adage: Fail to Plan... and You Plan to Fail!

AFEC INTERNATIONAL |Who We Are · Step 1 –Planning ”fail to plan plan to fail. ” oDefine the job –Brief the potential contractors oIdentify hazards –Facility Specific oAssess

Fail to Plan? Plan to Fail. - United Parcel Service - UPS · PDF fileFail to Plan? Plan to Fail. How professional service firms are closing the gap between strategic planning and execution

Failing to Plan is Planning to Fail - University of Notre Damewschmitt/nuzzifailtoplan2.pdf · Failing to Plan is Planning to Fail: SPECIAL TO MOMENTUM. ... asked me directly to suggest

Planning an Online Interaction "He who fails to plan, plans to fail" Anonymous Proverb