68
Planning to fail @davegardnerisme #phpuk2013

Planning to Fail #phpuk13

Embed Size (px)

DESCRIPTION

How to build resilient and reliable services by embracing failure.

Citation preview

Page 1: Planning to Fail #phpuk13

Planningto fail

@davegardnerisme#phpuk2013

Page 2: Planning to Fail #phpuk13

dave

Page 3: Planning to Fail #phpuk13

the taxi app

Page 4: Planning to Fail #phpuk13

Planningto fail

Page 5: Planning to Fail #phpuk13

Planningfor failure

Page 6: Planning to Fail #phpuk13

Planningto fail

Page 7: Planning to Fail #phpuk13

The beginning

Page 8: Planning to Fail #phpuk13

<?php

Page 9: Planning to Fail #phpuk13

My website: single VPS running PHP + MySQL

Page 10: Planning to Fail #phpuk13

No growth, low volume, simple functionality, one engineer (me!)

Page 11: Planning to Fail #phpuk13

Large growth, high volume, complex functionality, lots of engineers

Page 12: Planning to Fail #phpuk13

• Launched in LondonNovember 2011

• Now in 5 cities in 3 countries (30%+ growth every month)

• A Hailo hail is accepted around the world every 5 seconds

Page 13: Planning to Fail #phpuk13

“.. Brooks [1] reveals that the complexity of a software project grows as the square of the number of engineers and Leveson [17] cites evidence that most failures in complex systems result from unexpected inter-component interaction rather than intra-component bugs, we conclude that less machinery is (quadratically) better.”

http://lab.mscs.mu.edu/Dist2012/lectures/HarvestYield.pdf

Page 14: Planning to Fail #phpuk13

• SOA (10+ services)

• AWS (3 regions, 9 AZs, lots of instances)

• 10+ engineers building services

and you?(hailo is hiring)

Page 15: Planning to Fail #phpuk13

Our overall reliability is in

danger

Page 16: Planning to Fail #phpuk13

Embracing failure

(a coping strategy)

Page 17: Planning to Fail #phpuk13
Page 18: Planning to Fail #phpuk13

VPC(running PHP+MySQL)

reliable?

Page 19: Planning to Fail #phpuk13

Reliable!==

Resilient

Page 20: Planning to Fail #phpuk13
Page 21: Planning to Fail #phpuk13

Choosing a stack

Page 22: Planning to Fail #phpuk13

“Hailo”(running PHP+MySQL)

reliable?

Page 23: Planning to Fail #phpuk13

Service

each service does one job well

Service Service Service

Service Oriented Architecture

Page 24: Planning to Fail #phpuk13

• Fewer lines of code

• Fewer responsibilities

• Changes less frequently

• Can swap entire implementation if needed

Page 25: Planning to Fail #phpuk13

Service(running PHP+MySQL)

reliable?

Page 26: Planning to Fail #phpuk13

Service MySQL

MySQL running on different box

Page 27: Planning to Fail #phpuk13

Service

MySQL

MySQL

MySQL running in Multi-Master mode

Page 28: Planning to Fail #phpuk13

Going global

Page 29: Planning to Fail #phpuk13

MySQL

Separating concerns

CRUDLockingSearchAnalyticsID generation

also queuing…

Page 30: Planning to Fail #phpuk13

At Hailo we look for technologies that are:

• Distributedrun on more than one machine

• Homogenousall nodes look the same

• Resilientcan cope with the loss of node(s) with no loss of data

Page 31: Planning to Fail #phpuk13

“There is no such thing as standby infrastructure: there is stuff you always use and stuff that won’t work when you need it.”

http://blog.b3k.us/2012/01/24/some-rules.html

Page 32: Planning to Fail #phpuk13

• Highly performant, scalable and resilient data store

• Underpins much of what we do at Hailo

• Makes multi-DC easy!

Page 33: Planning to Fail #phpuk13

• Highly reliable distributed coordination

• We implement locking and leadership election on top of ZK and use sparingly

ZooKeeper

Page 34: Planning to Fail #phpuk13

• Distributed, RESTful, Search Engine built on top of Apache Lucene

• Replaced basic foo LIKE ‘%bar%’ queries (so much better)

Page 35: Planning to Fail #phpuk13

• Realtime message processing system designed to handle billions of messages per day

• Fault tolerant, highly available with reliable message delivery guarantee

NSQ

Page 36: Planning to Fail #phpuk13

• Distributed ID generation with no coordination required

• Rock solid

Cruftflake

Page 37: Planning to Fail #phpuk13

• All these technologies have similar properties of distribution and resilience

• They are designed to cope with failure

• They are not broken by design

Page 38: Planning to Fail #phpuk13

Lessons learned

Page 39: Planning to Fail #phpuk13

Minimise the critical path

Page 40: Planning to Fail #phpuk13

What is the minimum viable service?

Page 41: Planning to Fail #phpuk13

class HailoMemcacheService { private $mc = null;

public function __call() { $mc = $this->getInstance(); // do stuff }

private function getInstance() { if ($this->instance === null) { $this->mc = new \Memcached; $this->mc->addServers($s); } return $this->mc; }} Lazy-init instances; connect on use

Page 42: Planning to Fail #phpuk13

Configure clients carefully

Page 43: Planning to Fail #phpuk13

$this->mc = new \Memcached;$this->mc->addServers($s);

$this->mc->setOption( \Memcached::OPT_CONNECT_TIMEOUT, $connectTimeout);$this->mc->setOption( \Memcached::OPT_SEND_TIMEOUT, $sendRecvTimeout);$this->mc->setOption( \Memcached::OPT_RECV_TIMEOUT, $sendRecvTimeout);$this->mc->setOption( \Memcached::OPT_POLL_TIMEOUT, $connectionPollTimeout);

Make sure timeouts are configured

Page 44: Planning to Fail #phpuk13

Choose timeouts based on data

here?

Page 46: Planning to Fail #phpuk13

95th percentile

here?

Page 47: Planning to Fail #phpuk13

Test

Page 48: Planning to Fail #phpuk13

• Kill memcache on box A, measure impact on application

• Kill memcache on box B, measure impact on application

All fine.. we’ve got this covered!

Page 49: Planning to Fail #phpuk13

FAIL

Page 50: Planning to Fail #phpuk13

• Box A, running in AWS, locks up

• Any parts of application that touch Memcache stop working

Page 51: Planning to Fail #phpuk13

Things fail in exotic ways

Page 52: Planning to Fail #phpuk13

$ iptables -A INPUT -i eth0 \ -p tcp --dport 11211 -j REJECT

$ php test-memcache.php

Working OK!

Packets rejected and source notified by ICMP. Expect fast fails.

Page 53: Planning to Fail #phpuk13

$ iptables -A INPUT -i eth0 \ -p tcp --dport 11211 -j DROP

$ php test-memcache.php

Working OK!

Packets silently dropped. Expect long time outs.

Page 54: Planning to Fail #phpuk13

$ iptables -A INPUT -i eth0 \ -p tcp --dport 11211 \ -m state --state ESTABLISHED \ -j DROP

$ php test-memcache.php

Hangs! Uh oh.

Page 55: Planning to Fail #phpuk13

• When AWS instances hang they appear to accept connections but drop packets

• Bug!

https://bugs.launchpad.net/libmemcached/+bug/583031

Page 56: Planning to Fail #phpuk13

Fix, rinse, repeat

Page 57: Planning to Fail #phpuk13

It would be nice if we couldautomate this

Page 58: Planning to Fail #phpuk13

Automate!

Page 59: Planning to Fail #phpuk13

• Hailo run a dedicated automated test environment

• Powered by bash, JMeter and Graphite

• Continuous automated testing with failure simulations

Page 60: Planning to Fail #phpuk13

Fix attempt 1: bad timeouts configured

Page 61: Planning to Fail #phpuk13

Fix attempt 2: better timeouts

Page 62: Planning to Fail #phpuk13

Simulate in system tests

Page 63: Planning to Fail #phpuk13

Simulate failure

Assert monitoring endpoint picks this up

Assert features still work

Page 64: Planning to Fail #phpuk13

In conclusion

Page 66: Planning to Fail #phpuk13

TIMED BLOCK ALL THE THINGS

Page 68: Planning to Fail #phpuk13

Further reading

Hystrix: Latency and Fault Tolerance for Distributed Systemshttps://github.com/Netflix/Hystrix

Timelike: a network simulatorhttp://aphyr.com/posts/277-timelike-a-network-simulator

Notes on distributed systems for young bloodshttp://www.somethingsimilar.com/2013/01/14/notes-on-distributed-systems-for-young-bloods/

Stream de-duplication (relevant to NSQ)http://www.davegardner.me.uk/blog/2012/11/06/stream-de-duplication/

ID generation in distributed systemshttp://www.slideshare.net/davegardnerisme/unique-id-generation-in-distributed-systems