73
The Glue is the Hard Part: Making a Production-Ready PaaS Evan Krall Site Reliability Engineer @ Yelp

The Glue is the Hard Part: Making a Production-Ready PaaS

Embed Size (px)

Citation preview

Page 1: The Glue is the Hard Part: Making a Production-Ready PaaS

The Glue is the Hard Part:Making a Production-Ready PaaS

Evan KrallSite Reliability Engineer @ Yelp

Page 2: The Glue is the Hard Part: Making a Production-Ready PaaS

Agenda

PaaSTAWhat parts does PaaSTA have?How did we glue them together?

Wrap-up

IntroContext: Yelp before PaaSTAWhat's in a PaaS?

Production-ReadyWhat makes a PaaS production-ready?

Lessons learnedNext steps

Page 3: The Glue is the Hard Part: Making a Production-Ready PaaS

Intro

Page 4: The Glue is the Hard Part: Making a Production-Ready PaaS

Yelp’s Mission:Connecting people with great

local businesses.

4

Page 5: The Glue is the Hard Part: Making a Production-Ready PaaS

5

Yelp Stats:As of Q3 2015

89M 3271%90M

Page 6: The Glue is the Hard Part: Making a Production-Ready PaaS

Context: Yelp before PaaSTA

6

Page 7: The Glue is the Hard Part: Making a Production-Ready PaaS

Service Oriented ArchitectureScale our engineering team by splitting our

codebase into many smaller parts

7

Page 8: The Glue is the Hard Part: Making a Production-Ready PaaS

Dependency HellAs services gain adoption, shared libraries

become difficult to upgrade. Not all services are Python anymore.

8

Page 9: The Glue is the Hard Part: Making a Production-Ready PaaS

Too Many ServicesWe can no longer fit all services on each service host. How do we split them up?

9

Page 10: The Glue is the Hard Part: Making a Production-Ready PaaS

“I wonder how many organizations that say they're "doing DevOps" are actually

building a bespoke PaaS. And how many of those realize it.”

— @markimbriaco

10

Page 11: The Glue is the Hard Part: Making a Production-Ready PaaS

Basic PaaS Components

11

Page 12: The Glue is the Hard Part: Making a Production-Ready PaaS

SchedulingDecide which hosts run a service

12

Page 13: The Glue is the Hard Part: Making a Production-Ready PaaS

DeliveryPut the code on the host and run it

13

Page 14: The Glue is the Hard Part: Making a Production-Ready PaaS

DiscoveryTell clients where your service is running

14

Page 15: The Glue is the Hard Part: Making a Production-Ready PaaS

What makes a PaaS trustworthy enough to run our website?

Production-Ready

Page 16: The Glue is the Hard Part: Making a Production-Ready PaaS

16

Production-ready systems minimize impact of failures

impact =

frequency×

severity×

duration

Page 17: The Glue is the Hard Part: Making a Production-Ready PaaS

A production-ready PaaS should minimize the impact of both application failures

and PaaS failures

17

Page 18: The Glue is the Hard Part: Making a Production-Ready PaaS

Use stable components (software, hardware)You will always have failures.

Reduce failure frequency

18

Page 19: The Glue is the Hard Part: Making a Production-Ready PaaS

Reduce failure severity

19

Page 20: The Glue is the Hard Part: Making a Production-Ready PaaS

No SPOFsKeep working when a box dies

20

Page 21: The Glue is the Hard Part: Making a Production-Ready PaaS

Graceful DegradationAvoid full outages when components break

21

Page 22: The Glue is the Hard Part: Making a Production-Ready PaaS

Painless upgradesUpgrades should be easy, without downtime

22

Page 23: The Glue is the Hard Part: Making a Production-Ready PaaS

Reduce failure duration

23

Page 24: The Glue is the Hard Part: Making a Production-Ready PaaS

Self-healingRecover from common failures automatically

24

Page 25: The Glue is the Hard Part: Making a Production-Ready PaaS

AlertingTell humans when things are still broken

25

Page 26: The Glue is the Hard Part: Making a Production-Ready PaaS

VisibilityMake it easy for humans to diagnose issues

26

Page 27: The Glue is the Hard Part: Making a Production-Ready PaaS

PaaSTAYelp's Open-SourceDocker-based PaaS

Page 28: The Glue is the Hard Part: Making a Production-Ready PaaS

PaaSTA

28

● Delivery: Docker

● Scheduling: Mesos + Marathon

● Discovery: Smartstack

● Alerting: Sensu

Page 29: The Glue is the Hard Part: Making a Production-Ready PaaS

Delivery in PaaSTA: Docker

29

● Self-contained artifacts● Provides software flexibility● Reproducible builds● Resource limits make scheduling

easier

Page 30: The Glue is the Hard Part: Making a Production-Ready PaaS

● Mesos is an "SDK for distributed systems", batteries not included.

● Requires a framework○ Marathon (like ASG for Mesos)○ Chronos (Periodic tasks)

● Supports Docker as task executor

Scheduling in PaaSTA:Mesos and Marathon

30

Page 31: The Glue is the Hard Part: Making a Production-Ready PaaS

Marathon

● Run N copies of Docker image● Works with Mesos to find space on

cluster● Replaces dead instances

31

Page 32: The Glue is the Hard Part: Making a Production-Ready PaaS

32

from http://mesos.apache.org/documentation/latest/mesos-architecture/

Page 33: The Glue is the Hard Part: Making a Production-Ready PaaS

from http://mesos.apache.org/documentation/latest/mesos-architecture/

33

from http://mesos.apache.org/documentation/latest/mesos-architecture/

Page 34: The Glue is the Hard Part: Making a Production-Ready PaaS

from http://mesos.apache.org/documentation/latest/mesos-architecture/

34

from http://mesos.apache.org/documentation/latest/mesos-architecture/

(Marathon)

(Docker)

Page 35: The Glue is the Hard Part: Making a Production-Ready PaaS

from http://mesos.apache.org/documentation/latest/mesos-architecture/

35

from http://mesos.apache.org/documentation/latest/mesos-architecture/

(Marathon)

(Docker)

Page 36: The Glue is the Hard Part: Making a Production-Ready PaaS

from http://mesos.apache.org/documentation/latest/mesos-architecture/

36

from http://mesos.apache.org/documentation/latest/mesos-architecture/

(Marathon)

(Docker)

Page 37: The Glue is the Hard Part: Making a Production-Ready PaaS

from http://mesos.apache.org/documentation/latest/mesos-architecture/

37

from http://mesos.apache.org/documentation/latest/mesos-architecture/

(Marathon)

(Docker)

Page 38: The Glue is the Hard Part: Making a Production-Ready PaaS

How do we build & distribute Docker images?

38

Page 39: The Glue is the Hard Part: Making a Production-Ready PaaS

Building Docker images

39

● Jenkins builds and tests images● Bless images by creating git tags

○ 1:1 git commit <-> docker image

● Pushes to registry

Page 40: The Glue is the Hard Part: Making a Production-Ready PaaS

Shipping Docker images

40

● Distribution via private registry● S3 bucket shared among all

environments

Page 41: The Glue is the Hard Part: Making a Production-Ready PaaS

from http://mesos.apache.org/documentation/latest/mesos-architecture/

codemetadata

stagebuild prod

41

Page 42: The Glue is the Hard Part: Making a Production-Ready PaaS

How do we configure Marathon?

42

Page 43: The Glue is the Hard Part: Making a Production-Ready PaaS

Aside: Declarative Control

43

● Describe end goal, not path● Helps achieve fault tolerance

"Deploy 12abcd34 to prod"vs.

"Commit 12abcd34 should be running in prod"

Gas pedal vs. Cruise Control

Page 44: The Glue is the Hard Part: Making a Production-Ready PaaS

Configuring Marathon

44

● Need a wall around Marathon: it has root on your entire cluster.

● Cron job

● Combines per-service config and currently-blessed docker image

Page 45: The Glue is the Hard Part: Making a Production-Ready PaaS

marathon-$cluster.yaml

45

● # tasks

● CPU, memory

● How to healthcheck your service

● Bounce strategy

● Command / args

Page 46: The Glue is the Hard Part: Making a Production-Ready PaaS

Demo: Deploys

46

Page 47: The Glue is the Hard Part: Making a Production-Ready PaaS

How do services talk to each other?

47

Page 48: The Glue is the Hard Part: Making a Production-Ready PaaS

Discovery in PaaSTA:SmartStack● Registration agent on each box

writes to ZooKeeper

● Discovery agent on each box reads from ZK, configures HAProxy

48

Page 49: The Glue is the Hard Part: Making a Production-Ready PaaS

Registration

49

Page 50: The Glue is the Hard Part: Making a Production-Ready PaaS

Registering with SmartStack

50

● configure_nerve.py queries local mesos-slave API

● Keeping it local means registration works even if Mesos master or Marathon is down.

● We can register non-PaaSTA services as well

Page 51: The Glue is the Hard Part: Making a Production-Ready PaaS

from http://mesos.apache.org/documentation/latest/mesos-architecture/

hacheck

service_1

service_2

service_3

Service host

ZK configure_nerve.py

nerve

metadatahealthcheck

Architecture: Registration

51

Page 52: The Glue is the Hard Part: Making a Production-Ready PaaS

Nerve registers service instance in ZooKeeper:

/nerve/region:myregion ├── service_1 │ └── server_1_0000013614 ├── service_2 │ └── server_1_0000000959 ├── service_3 │ ├── server_1_0000002468 │ └── server_2_0000002467 [...]

from http://mesos.apache.org/documentation/latest/mesos-architecture/

{ "host":"10.0.0.123", "port":31337, "name":"server_1", "weight":10,}

ZooKeeper Data

52

Page 53: The Glue is the Hard Part: Making a Production-Ready PaaS

Normally hacheck acts as a transparent proxy for healthchecks:$ curl -s yocalhost:6666/http/service_1/1234/status{ "uptime": 5693819.315988064, "pid": 2595160, "host": "server_1", "version": "b6309e09d71da8f1e28213d251f7c",}$

hacheck

53

Page 54: The Glue is the Hard Part: Making a Production-Ready PaaS

Can also force healthchecks to fail before we shut down a service$ hadown service_1$ curl -s yocalhost:6666/http/service_1/1234/statusService service_1 in down state since 1443217910: krall$

hacheck

54

Page 55: The Glue is the Hard Part: Making a Production-Ready PaaS

Discovery

55

Page 56: The Glue is the Hard Part: Making a Production-Ready PaaS

synapse

haproxy

ZK

client

configure_synapse.py

nerve

metadata

traffic

Architecture: Discovery

56

Page 57: The Glue is the Hard Part: Making a Production-Ready PaaS

HAProxy● By default, bind to 0.0.0.0● Bind only to yocalhost on public-

facing servers● Gives us goodies for all clients:○ Redispatch on conn failure○ Easy request logging○ Rock-solid load balancing

57

Page 58: The Glue is the Hard Part: Making a Production-Ready PaaS

yocalhost

58

● One HAProxy per host

● What address to bind HAProxy to?

● 127.0.0.1 is per-container

● Add loopback address to host: 169.254.255.254

● This also works on servers without Docker

Page 59: The Glue is the Hard Part: Making a Production-Ready PaaS

docker container 2

lo 127.0.0.1

eth0 169.254.14.18

docker container 1

yocalhost

59

lo 127.0.0.1

eth0 169.254.14.17

docker0 169.254.1.1

eth0 10.1.2.3

haproxy

lo 127.0.0.1

lo:0 169.254.255.244

Page 60: The Glue is the Hard Part: Making a Production-Ready PaaS

smartstack.yaml

60

● port that HAProxy binds to

● mode (TCP/HTTP)

● Timeouts

● Healthcheck URI

Page 61: The Glue is the Hard Part: Making a Production-Ready PaaS

Demo: Discovery

61

Page 62: The Glue is the Hard Part: Making a Production-Ready PaaS

Monitoring

62

Page 63: The Glue is the Hard Part: Making a Production-Ready PaaS

Monitoring a PaaS is different

63

● Things can change frequently

○ Which boxes run which services?

○ What services even exist?

● Traditional "host X runs service Y" checks don't work anymore.

Page 64: The Glue is the Hard Part: Making a Production-Ready PaaS

Monitor the invariants

64

● N copies of a service are running

● Marathon running on X,Y,Z

● All nodes are running mesos-slave, synapse, nerve, docker

● Cron jobs have succeeded recently

Page 65: The Glue is the Hard Part: Making a Production-Ready PaaS

Sensu monitoring

65

● Decentralized checking

● Client executes checks, puts results on a message queue

● Sensu servers handle results from the queue, route them to email, PagerDuty, JIRA, etc.

Page 66: The Glue is the Hard Part: Making a Production-Ready PaaS

try:

something that might fail

except:

send failure event

else:

send success event

We can send our own events

66

Page 67: The Glue is the Hard Part: Making a Production-Ready PaaS

Lessons LearnedWhat has PaaSTA taught us?

Page 68: The Glue is the Hard Part: Making a Production-Ready PaaS

Interfaces are important

68

Page 69: The Glue is the Hard Part: Making a Production-Ready PaaS

App-Infra boundaryPermissive enough for developers to do their

job, strict enough to prevent infrastructure from ballooning

69

Page 70: The Glue is the Hard Part: Making a Production-Ready PaaS

The right abstractions can save you a lot of work if you need to swap components

Between infra components

70

Page 71: The Glue is the Hard Part: Making a Production-Ready PaaS

Iterative improvements find local optima

Sometimes you need to take bigger risks to get bigger rewards

"Evolution versus Revolution"

71

Page 72: The Glue is the Hard Part: Making a Production-Ready PaaS

● It's open source now!

● More polish, docs, examples

● Support more technologies

○ Chronos in-progress

○ Docker Swarm?

○ Kubernetes?

What's next for PaaSTA?

72

Page 73: The Glue is the Hard Part: Making a Production-Ready PaaS

Thank you!Evan Krall@[email protected]