Kubernetes to scale

Preview:

Citation preview

Kubernetes to Scale

michele.orsi@lastminute.com @micheleorsi

GDG Cloud - London, 11 January 2017

Started with a monolith ...

https://www.flickr.com/photos/southtopia/5702790189

https://www.pexels.com/photo/gray-pebbles-with-green-grass-51168/

... broken into microservices

Micro-problems at scale

● alignment

● real pipelines

● infrastructure

● resilience

● monitoring

● constraints

An year-long endeavour

● build a new, modern infrastructure

● migrate the search (flight/hotel) product there

... without:

● impacting the business● throwing away our whole datacenter

How we did that: technology

● company framework

● docker

● kubernetes

How? Teams and peopleHow we did that: team/people

https://www.pexels.com/photo/blue-lego-toy-beside-orange-and-white-lego-toy-standing-during-daytime-105822/

APP3-PRODUCTION

Kubernetes: our architecture

APP2-PRODUCTIONAPP1-PRODUCTION

APP3-PRODUCTIONAPP2-PRODUCTION

APP1-PREVIEW

APP3-PRODUCTIONAPP2-PRODUCTION

APP1-DEVELOPMENT

APP3-PRODUCTIONAPP2-PRODUCTION

APP1-QA

APP3-PRODUCTIONAPP2-PRODUCTION

APP1-STRESSTEST

nonproductionproduction

Kubernetes: our architecture

APP1-PRODUCTION

deployment

replica-set

POD3

POD2

POD1

production

Kubernetes: our architecture

APP1-PRODUCTION

deployment

replica-set

secret configmap

POD3

POD2

POD1

production

Kubernetes: our architecture

APP1-PRODUCTION

deployment

replica-set

(ingress)path: app1-production.prd.lmn.intra

secret configmap

POD3

POD2

POD1

production

Kubernetes: our architecture

nginx-ingress-ctrl: 80

cluster

F5POD

10.0.0.2

POD10.0.0.1

nginx-ingress-ctrl: 80

nginx-ingress-ctrl: 80

POD10.0.0.3POD

10.0.0.4

POD10.0.0.5

POD10.0.0.6

APP1-PRODUCTION

Kubernetes: our architecture

POD

collectd

production

application fluentd

/liveness:

● when tomcat container is up● when “active/max” threads < threshold

/readiness:

● all the startup jobs have run● no termination request has been received

.. ongoing never-ending research ..

Self-healing: our choice for resilience

Kubernetes: what’s left outside?

● datastores

● distributed caches (early 2017)

● distributed locking

● pub-sub/queues

● logs and metrics storage

● zero downtime during rollout

● monitoring in place

● alerting

● centralized logging

● legacy infrastructure to the rescue in case of problem

When can you test with production traffic?

... failure ... at all different levels ..

https://www.flickr.com/photos/ghost_of_kuji/2763674926

Main problems

● configuration

● infrastructure

● tools

● manual mistakes

● (external) scalability

There’s light .. at the end

https://www.pexels.com/photo/grayscale-photography-of-person-at-the-end-of-tunnel-211816/

Pipeline: a huge step forward

microservice = factory.newDeployRequest().withArtifact(“com.lastminute.application1”,2)

lmn_deployCanaryStrategy(microservice,”qa”)

lmn_deployStableStrategy(microservice,”preview”)

lmn_deployCanaryStrategy(microservice,”production”)

pipeline

APP1-PRODUCTION

POD

Monitoring: grafana/graphite/nagios

cluster

graphiteapplication collectd

Grafana

nagios

icons from http://www.flaticon.com

● lead and migration time

● resilience

● root cause analysis

● speed of deployment

● instant scaling

... benefits

● 36 bare-metal nodes (only for production cluster)● 5100 req/sec in the new cluster● 2M metrics/minute flows● 35 micro-services migrated in 5 months

○ 3 new micro-services migrated per week○ 10 minutes to create a new environment

● 11 min to roll-out a new version with 55 instances○ whole pipeline runs in 16 min

Give me the numbers!

Yes, we’re hiring!

THANKS

www.lastminutegroup.com

Recommended