Planning to Fail #phpne13

Preview:

DESCRIPTION

Slides from my Planning to Fail talk given at PHP North East conference 2013. This is a slightly longer version of the same talk given at the PHP UK conference. The talk was on how you can build resilient systems by embracing failure.

Citation preview

Planningto fail

@davegardnerisme#phpne13

dave

the taxi app

Planningto fail

Planningfor failure

Planningto fail

99.9% (three nines)

Downtime:

43.8 minutes per month8.76 hours per year

99.99% (four nines)

Downtime:

4.32 minutes per month52.56 minutes per year

99.999% (five nines)

Downtime:

25.9 seconds per month5.26 minutes per year

www.whoownsmyavailability.com

?

www.whoownsmyavailability.com

YOU

The beginning

<?php

My website: single VPS running PHP + MySQL

No growth, low volume, simple functionality, one engineer (me!)

Large growth, high volume, complex functionality, lots of engineers

• Launched in LondonNovember 2011

• Now in 5 cities in 3 countries (30%+ growth every month)

• A Hailo hail is accepted around the world every 5 seconds

“.. Brooks [1] reveals that the complexity of a software project grows as the square of the number of engineers and Leveson [17] cites evidence that most failures in complex systems result from unexpected inter-component interaction rather than intra-component bugs, we conclude that less machinery is (quadratically) better.”

http://lab.mscs.mu.edu/Dist2012/lectures/HarvestYield.pdf

• SOA (10+ services)

• AWS (3 regions, 9 AZs, lots of instances)

• 10+ engineers building services

and you?(hailo is hiring)

Our overall reliability is in

danger

Embracing failure

(a coping strategy)

VPC(running PHP+MySQL)

reliable?

Reliable!==

Resilient

Choosing a stack

“Hailo”(running PHP+MySQL)

reliable?

Service

each service does one job well

Service Service Service

Service Oriented Architecture

• Fewer lines of code

• Fewer responsibilities

• Changes less frequently

• Can swap entire implementation if needed

Service(running PHP+MySQL)

reliable?

Service MySQL

MySQL running on different box

Service

MySQL

MySQL

MySQL running in Multi-Master mode

Going global

MySQL

Separating concerns

CRUDLockingSearchAnalyticsID generation

also queuing…

At Hailo we look for technologies that are:

• Distributedrun on more than one machine

• Homogenousall nodes look the same

• Resilientcan cope with the loss of node(s) with no loss of data

“There is no such thing as standby infrastructure: there is stuff you always use and stuff that won’t work when you need it.”

http://blog.b3k.us/2012/01/24/some-rules.html

• Highly performant, scalable and resilient data store

• Underpins much of what we do at Hailo

• Makes multi-DC easy!

• Highly reliable distributed coordination

• We implement locking and leadership election on top of ZK and use sparingly

ZooKeeper

• Distributed, RESTful, Search Engine built on top of Apache Lucene

• Replaced basic foo LIKE ‘%bar%’ queries (so much better)

• Realtime message processing system designed to handle billions of messages per day

• Fault tolerant, highly available with reliable message delivery guarantee

NSQ

• Real time incremental analytics platform, backed by Apache Cassandra

• Powerful SQL-like interface

• Scalable and highly available

• Distributed ID generation with no coordination required

• Rock solid

Cruftflake

• All these technologies have similar properties of distribution and resilience

• They are designed to cope with failure

• They are not broken by design

Lessons learned

Minimise the critical path

What is the minimum viable service?

class HailoMemcacheService { private $mc = null;

public function __call() { $mc = $this->getInstance(); // do stuff }

private function getInstance() { if ($this->instance === null) { $this->mc = new \Memcached; $this->mc->addServers($s); } return $this->mc; }} Lazy-init instances; connect on use

Configure clients carefully

$this->mc = new \Memcached;$this->mc->addServers($s);

$this->mc->setOption( \Memcached::OPT_CONNECT_TIMEOUT, $connectTimeout);$this->mc->setOption( \Memcached::OPT_SEND_TIMEOUT, $sendRecvTimeout);$this->mc->setOption( \Memcached::OPT_RECV_TIMEOUT, $sendRecvTimeout);$this->mc->setOption( \Memcached::OPT_POLL_TIMEOUT, $connectionPollTimeout);

Make sure timeouts are configured

Choose timeouts based on data

here?

95th percentile

here?

Test

• Kill memcache on box A, measure impact on application

• Kill memcache on box B, measure impact on application

All fine.. we’ve got this covered!

FAIL

• Box A, running in AWS, locks up

• Any parts of application that touch Memcache stop working

Things fail in exotic ways

$ iptables -A INPUT -i eth0 \ -p tcp --dport 11211 -j REJECT

$ php test-memcache.php

Working OK!

Packets rejected and source notified by ICMP. Expect fast fails.

$ iptables -A INPUT -i eth0 \ -p tcp --dport 11211 -j DROP

$ php test-memcache.php

Working OK!

Packets silently dropped. Expect long time outs.

$ iptables -A INPUT -i eth0 \ -p tcp --dport 11211 \ -m state --state ESTABLISHED \ -j DROP

$ php test-memcache.php

Hangs! Uh oh.

• When AWS instances hang they appear to accept connections but drop packets

• Bug!

https://bugs.launchpad.net/libmemcached/+bug/583031

Fix, rinse, repeat

RabbitMQ RabbitMQ RabbitMQ

Service

AMQP (port 5672)

HA cluster

$ iptables -A INPUT -i eth0 \ -p tcp --dport 5672 \ -m state --state ESTABLISHED \ -j DROP

$ php test-rabbitmq.php

Fantastic! Block AMQP port, client times out

FAIL

“RabbitMQ clusters do not tolerate network partitions well.”

http://www.rabbitmq.com/partitions.html

$ epmd –namesepmd: up and running on port 4369 with data:name rabbit at port 60278

Each node listens on a port assigned by EPMD

$ iptables -A INPUT -i eth0 \ -p tcp --dport 60278 \ -m state --state ESTABLISHED \ -j DROP

$ php test-rabbitmq.php

Hangs! Uh oh.

Mnesia('rabbit@dmzutilities03-global01-test'): ** ERROR ** mnesia_event got {inconsistent_database, running_partitioned_network, 'rabbit@dmzutilities01-global01-test'}

application: rabbitmq_managementexited: shutdowntype: temporary

RabbitMQ logs show partitioned network error; nodes shutdown

while ($read < $n && !feof($this->sock->real_sock()) && (false !== ($buf = fread( $this->sock->real_sock(), $n - $read)))) { $read += strlen($buf); $res .= $buf;}

PHP library didn’t have any time limit on reading a frame

Fix, rinse, repeat

It would be nice if we couldautomate this

Automate!

• Hailo run a dedicated automated test environment

• Powered by bash, JMeter and Graphite

• Continuous automated testing with failure simulations

Fix attempt 1: bad timeouts configured

Fix attempt 2: better timeouts

Simulate in system tests

Simulate failure

Assert monitoring endpoint picks this up

Assert features still work

In conclusion

You should test for failure

How does the software react?How does the PHP client react?

Automation makes continuous failure testing feasible

Systems that cope well with failure are easier to operate

TIMED BLOCK ALL THE THINGS

Further reading

Hystrix: Latency and Fault Tolerance for Distributed Systemshttps://github.com/Netflix/Hystrix

Timelike: a network simulatorhttp://aphyr.com/posts/277-timelike-a-network-simulator

Notes on distributed systems for young bloodshttp://www.somethingsimilar.com/2013/01/14/notes-on-distributed-systems-for-young-bloods/

Stream de-duplication (relevant to NSQ)http://www.davegardner.me.uk/blog/2012/11/06/stream-de-duplication/

ID generation in distributed systemshttp://www.slideshare.net/davegardnerisme/unique-id-generation-in-distributed-systems

Recommended