88
Planning to fail @davegardnerisme #phpne13

Planning to Fail #phpne13

Embed Size (px)

DESCRIPTION

Slides from my Planning to Fail talk given at PHP North East conference 2013. This is a slightly longer version of the same talk given at the PHP UK conference. The talk was on how you can build resilient systems by embracing failure.

Citation preview

Page 1: Planning to Fail #phpne13

Planningto fail

@davegardnerisme#phpne13

Page 2: Planning to Fail #phpne13

dave

Page 3: Planning to Fail #phpne13

the taxi app

Page 4: Planning to Fail #phpne13

Planningto fail

Page 5: Planning to Fail #phpne13

Planningfor failure

Page 6: Planning to Fail #phpne13

Planningto fail

Page 8: Planning to Fail #phpne13

99.9% (three nines)

Downtime:

43.8 minutes per month8.76 hours per year

Page 9: Planning to Fail #phpne13

99.99% (four nines)

Downtime:

4.32 minutes per month52.56 minutes per year

Page 10: Planning to Fail #phpne13

99.999% (five nines)

Downtime:

25.9 seconds per month5.26 minutes per year

Page 11: Planning to Fail #phpne13

www.whoownsmyavailability.com

?

Page 12: Planning to Fail #phpne13

www.whoownsmyavailability.com

YOU

Page 13: Planning to Fail #phpne13

The beginning

Page 14: Planning to Fail #phpne13

<?php

Page 15: Planning to Fail #phpne13

My website: single VPS running PHP + MySQL

Page 16: Planning to Fail #phpne13

No growth, low volume, simple functionality, one engineer (me!)

Page 17: Planning to Fail #phpne13

Large growth, high volume, complex functionality, lots of engineers

Page 18: Planning to Fail #phpne13

• Launched in LondonNovember 2011

• Now in 5 cities in 3 countries (30%+ growth every month)

• A Hailo hail is accepted around the world every 5 seconds

Page 19: Planning to Fail #phpne13

“.. Brooks [1] reveals that the complexity of a software project grows as the square of the number of engineers and Leveson [17] cites evidence that most failures in complex systems result from unexpected inter-component interaction rather than intra-component bugs, we conclude that less machinery is (quadratically) better.”

http://lab.mscs.mu.edu/Dist2012/lectures/HarvestYield.pdf

Page 20: Planning to Fail #phpne13

• SOA (10+ services)

• AWS (3 regions, 9 AZs, lots of instances)

• 10+ engineers building services

and you?(hailo is hiring)

Page 21: Planning to Fail #phpne13

Our overall reliability is in

danger

Page 22: Planning to Fail #phpne13

Embracing failure

(a coping strategy)

Page 23: Planning to Fail #phpne13
Page 24: Planning to Fail #phpne13

VPC(running PHP+MySQL)

reliable?

Page 25: Planning to Fail #phpne13

Reliable!==

Resilient

Page 26: Planning to Fail #phpne13

Choosing a stack

Page 27: Planning to Fail #phpne13

“Hailo”(running PHP+MySQL)

reliable?

Page 28: Planning to Fail #phpne13

Service

each service does one job well

Service Service Service

Service Oriented Architecture

Page 29: Planning to Fail #phpne13

• Fewer lines of code

• Fewer responsibilities

• Changes less frequently

• Can swap entire implementation if needed

Page 30: Planning to Fail #phpne13

Service(running PHP+MySQL)

reliable?

Page 31: Planning to Fail #phpne13

Service MySQL

MySQL running on different box

Page 32: Planning to Fail #phpne13

Service

MySQL

MySQL

MySQL running in Multi-Master mode

Page 33: Planning to Fail #phpne13

Going global

Page 34: Planning to Fail #phpne13

MySQL

Separating concerns

CRUDLockingSearchAnalyticsID generation

also queuing…

Page 35: Planning to Fail #phpne13

At Hailo we look for technologies that are:

• Distributedrun on more than one machine

• Homogenousall nodes look the same

• Resilientcan cope with the loss of node(s) with no loss of data

Page 36: Planning to Fail #phpne13

“There is no such thing as standby infrastructure: there is stuff you always use and stuff that won’t work when you need it.”

http://blog.b3k.us/2012/01/24/some-rules.html

Page 37: Planning to Fail #phpne13

• Highly performant, scalable and resilient data store

• Underpins much of what we do at Hailo

• Makes multi-DC easy!

Page 38: Planning to Fail #phpne13

• Highly reliable distributed coordination

• We implement locking and leadership election on top of ZK and use sparingly

ZooKeeper

Page 39: Planning to Fail #phpne13

• Distributed, RESTful, Search Engine built on top of Apache Lucene

• Replaced basic foo LIKE ‘%bar%’ queries (so much better)

Page 40: Planning to Fail #phpne13

• Realtime message processing system designed to handle billions of messages per day

• Fault tolerant, highly available with reliable message delivery guarantee

NSQ

Page 41: Planning to Fail #phpne13

• Real time incremental analytics platform, backed by Apache Cassandra

• Powerful SQL-like interface

• Scalable and highly available

Page 42: Planning to Fail #phpne13

• Distributed ID generation with no coordination required

• Rock solid

Cruftflake

Page 43: Planning to Fail #phpne13

• All these technologies have similar properties of distribution and resilience

• They are designed to cope with failure

• They are not broken by design

Page 44: Planning to Fail #phpne13

Lessons learned

Page 45: Planning to Fail #phpne13

Minimise the critical path

Page 46: Planning to Fail #phpne13

What is the minimum viable service?

Page 47: Planning to Fail #phpne13

class HailoMemcacheService { private $mc = null;

public function __call() { $mc = $this->getInstance(); // do stuff }

private function getInstance() { if ($this->instance === null) { $this->mc = new \Memcached; $this->mc->addServers($s); } return $this->mc; }} Lazy-init instances; connect on use

Page 48: Planning to Fail #phpne13

Configure clients carefully

Page 49: Planning to Fail #phpne13

$this->mc = new \Memcached;$this->mc->addServers($s);

$this->mc->setOption( \Memcached::OPT_CONNECT_TIMEOUT, $connectTimeout);$this->mc->setOption( \Memcached::OPT_SEND_TIMEOUT, $sendRecvTimeout);$this->mc->setOption( \Memcached::OPT_RECV_TIMEOUT, $sendRecvTimeout);$this->mc->setOption( \Memcached::OPT_POLL_TIMEOUT, $connectionPollTimeout);

Make sure timeouts are configured

Page 50: Planning to Fail #phpne13

Choose timeouts based on data

here?

Page 52: Planning to Fail #phpne13

95th percentile

here?

Page 53: Planning to Fail #phpne13

Test

Page 54: Planning to Fail #phpne13

• Kill memcache on box A, measure impact on application

• Kill memcache on box B, measure impact on application

All fine.. we’ve got this covered!

Page 55: Planning to Fail #phpne13

FAIL

Page 56: Planning to Fail #phpne13

• Box A, running in AWS, locks up

• Any parts of application that touch Memcache stop working

Page 57: Planning to Fail #phpne13

Things fail in exotic ways

Page 58: Planning to Fail #phpne13

$ iptables -A INPUT -i eth0 \ -p tcp --dport 11211 -j REJECT

$ php test-memcache.php

Working OK!

Packets rejected and source notified by ICMP. Expect fast fails.

Page 59: Planning to Fail #phpne13

$ iptables -A INPUT -i eth0 \ -p tcp --dport 11211 -j DROP

$ php test-memcache.php

Working OK!

Packets silently dropped. Expect long time outs.

Page 60: Planning to Fail #phpne13

$ iptables -A INPUT -i eth0 \ -p tcp --dport 11211 \ -m state --state ESTABLISHED \ -j DROP

$ php test-memcache.php

Hangs! Uh oh.

Page 61: Planning to Fail #phpne13

• When AWS instances hang they appear to accept connections but drop packets

• Bug!

https://bugs.launchpad.net/libmemcached/+bug/583031

Page 62: Planning to Fail #phpne13

Fix, rinse, repeat

Page 63: Planning to Fail #phpne13

RabbitMQ RabbitMQ RabbitMQ

Service

AMQP (port 5672)

HA cluster

Page 64: Planning to Fail #phpne13

$ iptables -A INPUT -i eth0 \ -p tcp --dport 5672 \ -m state --state ESTABLISHED \ -j DROP

$ php test-rabbitmq.php

Fantastic! Block AMQP port, client times out

Page 65: Planning to Fail #phpne13

FAIL

Page 66: Planning to Fail #phpne13

“RabbitMQ clusters do not tolerate network partitions well.”

http://www.rabbitmq.com/partitions.html

Page 67: Planning to Fail #phpne13

$ epmd –namesepmd: up and running on port 4369 with data:name rabbit at port 60278

Each node listens on a port assigned by EPMD

Page 68: Planning to Fail #phpne13
Page 69: Planning to Fail #phpne13

$ iptables -A INPUT -i eth0 \ -p tcp --dport 60278 \ -m state --state ESTABLISHED \ -j DROP

$ php test-rabbitmq.php

Hangs! Uh oh.

Page 70: Planning to Fail #phpne13

Mnesia('rabbit@dmzutilities03-global01-test'): ** ERROR ** mnesia_event got {inconsistent_database, running_partitioned_network, 'rabbit@dmzutilities01-global01-test'}

application: rabbitmq_managementexited: shutdowntype: temporary

RabbitMQ logs show partitioned network error; nodes shutdown

Page 71: Planning to Fail #phpne13
Page 72: Planning to Fail #phpne13

while ($read < $n && !feof($this->sock->real_sock()) && (false !== ($buf = fread( $this->sock->real_sock(), $n - $read)))) { $read += strlen($buf); $res .= $buf;}

PHP library didn’t have any time limit on reading a frame

Page 73: Planning to Fail #phpne13

Fix, rinse, repeat

Page 74: Planning to Fail #phpne13

It would be nice if we couldautomate this

Page 75: Planning to Fail #phpne13

Automate!

Page 76: Planning to Fail #phpne13

• Hailo run a dedicated automated test environment

• Powered by bash, JMeter and Graphite

• Continuous automated testing with failure simulations

Page 77: Planning to Fail #phpne13

Fix attempt 1: bad timeouts configured

Page 78: Planning to Fail #phpne13

Fix attempt 2: better timeouts

Page 79: Planning to Fail #phpne13

Simulate in system tests

Page 80: Planning to Fail #phpne13

Simulate failure

Assert monitoring endpoint picks this up

Assert features still work

Page 81: Planning to Fail #phpne13

In conclusion

Page 83: Planning to Fail #phpne13

You should test for failure

How does the software react?How does the PHP client react?

Page 84: Planning to Fail #phpne13

Automation makes continuous failure testing feasible

Page 85: Planning to Fail #phpne13

Systems that cope well with failure are easier to operate

Page 86: Planning to Fail #phpne13

TIMED BLOCK ALL THE THINGS

Page 88: Planning to Fail #phpne13

Further reading

Hystrix: Latency and Fault Tolerance for Distributed Systemshttps://github.com/Netflix/Hystrix

Timelike: a network simulatorhttp://aphyr.com/posts/277-timelike-a-network-simulator

Notes on distributed systems for young bloodshttp://www.somethingsimilar.com/2013/01/14/notes-on-distributed-systems-for-young-bloods/

Stream de-duplication (relevant to NSQ)http://www.davegardner.me.uk/blog/2012/11/06/stream-de-duplication/

ID generation in distributed systemshttp://www.slideshare.net/davegardnerisme/unique-id-generation-in-distributed-systems