36
HERDING CATS IN THE CLOUD MAINTAINING OPERATIONAL SANITY IN A CLOUDY, DEVOPS WORLD Dewey Sasser Consulting Cloud Architect Algined Software

Herding cats in the Cloud

Embed Size (px)

Citation preview

Page 1: Herding cats in the Cloud

HERDING CATS IN THE CLOUDMAINTAINING OPERATIONAL SANITY IN A CLOUDY, DEVOPS WORLD

Dewey Sasser

Consulting Cloud Architect

Algined Software

Page 2: Herding cats in the Cloud

ABOUT THIS TALK

Public Clouds can give developers unprecedented levels

of power

“With Great Power Comes Great Responsibility”

You must structure your development and production

deployment process to use this power well

How do we do this? Experience from a large deployment

Page 3: Herding cats in the Cloud

ABOUT DEWEY

Distributed Application Developer for 20 years

Doing build/release/software process for about that long

Accidentally doing devops out of self-defense

Wandered in operations about 5 years ago

Built some private cloud for dev

Built some private cloud for prod

Starting architecting using public cloud for everything

Page 4: Herding cats in the Cloud

ABOUT THE COMPANY

Company Policy: don’t talk for the company

Therefore, these slides don't mention The Company.

There is no information here that is not otherwise publicly available.

Whoever it is, I don't speak for them

Major Gaming Company, multiple AAA titles

History in MMOs

All in on mobile now

Page 5: Herding cats in the Cloud

ROADMAP

What we Did

What we're Doing

What we might do Next

Page 6: Herding cats in the Cloud

WE'RE COMING FROM...

Traditionally MMOs in colo

Windows (ugh!) based servers

All in cloud now: mobile, cloud, Docker, MongoDB, Phoenix

Servers, Chaos Monkey, (...other popular buzzwords)

Page 7: Herding cats in the Cloud

GOALS

100% uptime: players want to play

No more: Patch days, "Down for maintenance"

Profit ( = revenue – cost)

Page 8: Herding cats in the Cloud

SCALE

$100ks of monthly spend

Many hundreds of instances

Around 500TB of monthly transfer

Peak to 12k tps (for a single title)

Around 1 PB of storage

Approximately 5 billion I/Os monthly

Page 9: Herding cats in the Cloud

USAGE/LOAD PATTERN

Traditional SAS assumes starting small and scaling. Scaling

quickly is a problem, but a good problem.

Games are weird

Peak usage is release day, it tails off after that

You must be able to scale out of the gate. Users that cannot use

it the first day will often never be back!

Page 10: Herding cats in the Cloud

PLATFORMS

Swarm pattern

Pods of services

Python/NGINX

Batch Processing pattern

Vertica

Elastic Map/Reduce

Work Queue (Kafka)

NoSQL (MongoDB – ugh!)

Gaming Platform

CoreOS/Docker

Strong Phoenix Server pattern

Page 11: Herding cats in the Cloud

PROCESS/SOCIAL APPROACH

Must be (people) scalable

Working on 3 new games at any one time

Still supporting old games

Supporting services for the larger company

Don't create a bottleneck

“I'm waiting for a VM”. Bad process. No biscuit.

There are too many controls to get least privilege right!

Validation, not prevention (WHAT???)

Page 12: Herding cats in the Cloud

POLICIES ARE GREAT, BUT...

They change over time

Are hard to get exactly right up front

Always have exceptions

The space of AWS permissions is HUGE. Permutations are deadly.

So...measure what you care about.

What you care about will change over time.

Trust...and verify

Page 13: Herding cats in the Cloud

POWER TO THE PEOPLE (OR DEVELOPERS)

Don't gate productivity on fine points of arbitrary policies

Keep responsibility with dev team

domain expertise

put the pain where the control is

Stuff gets automated!!!

Page 14: Herding cats in the Cloud

APPROACH

Cloud Environment

Multiple accounts (~ 2 dozen right now)

1 central services account

1 account per title

All environments in different VPCs (Dev, QA, Perf, Staging, Prod)

Page 15: Herding cats in the Cloud

DEV TEAMS RESPONSIBLE FOR...

Developing, validating, deploying and running their games

Responding to production issues

PRODUCTION cost control

Page 16: Herding cats in the Cloud

CENTRAL "CLOUD SERVICES" TEAM

“Owns”

Metrics, Monitoring, Alerting

Enables use of central services & good practices

Composable components used by the teams

Native packaging -- make it easy

Manages good practices

Their job is to be cloud experts

But they're not the only ones in the company

LOTS of conversation!

Automates everything non-project specific

New account creation, ...

Page 17: Herding cats in the Cloud

OWNERSHIP/RESPONSIBILITY

Clearly align authority and responsibility.

If a Dev is getting up in the middle of the night to fix

something, they have to have full power to fix it.

On a related note, that means the teams get approval

control over a great deal

Page 18: Herding cats in the Cloud

GREAT, HOW?

Page 19: Herding cats in the Cloud

CRITICAL TOOL: RULES & WORKFLOWS

Custom developed rules/workflow system

Rules are small, stateless snippets of Python code that

trigger workflows

But can be company public and extensible by pull request

Workflows are potentially long running, stateful

operations that trigger list of changes.

Can also be company public, but tighter controls around

changes.

Changes can be reviewed manually or automatically.

Page 20: Herding cats in the Cloud

CRITICAL TOOL: RULES & WORKFLOWS

Runtime is HIGHLY privileged – keep it tight!

This tool can destroy the world – but it

actually keeps it running.

(you have everything automated to recreate the

world, right?)

Page 21: Herding cats in the Cloud

USER ACCESS CONTROL

Automate user management/creation from source in GIT

Define membership rules as intersection of desired group and

account characteristics (MFA anyone?)

Rules/Workflow enforces MFA. Central team doesn't have to

Remove your MFA, get demoted to “User”

Page 22: Herding cats in the Cloud

USER ACCESS CONTROL

Don't try for least privilege – you won't get it right and it will be different tomorrow

There are a small number of access levels and people are sorted into those levels per

account

User

ReadOnly (Manager)

Finance

Developer

DevOps

FullAdmin

Page 23: Herding cats in the Cloud

USER ACCESS CONTROL NG

Federation? Yes, but there are issues

SSO? Likewise

We'll probably go to a SAML based federated MFA

gateway

We might go to AD based access

Page 24: Herding cats in the Cloud

NETWORK ACCESS CONTROL

VPN into the cloud

Bastion hosts

Private VPCs

Shared root keys

Yup, shared.

No user management on individual nodes

Cattle, not cats

Page 25: Herding cats in the Cloud

COST CONTROL

It's a thing.

It's a really BIG THING!

Page 26: Herding cats in the Cloud

COST CONTROL

Tagging policy

Owner (who to go to)

Environment (Dev, Prod, QA, …)

Project (Cost Center – DO NOT USE THIS FOR AUTOMATIOLN!)

Enforce tagging by rules/workflow process

Measure compliance, escalate to GM

Kill off instances that don't comply

With lots of warning

Now tools will give good data

CloudHealth (there are others)

Page 27: Herding cats in the Cloud

WHAT YOU CARE ABOUT WITH COSTS (AWS SPECIFIC)

Reserved Instances

Go for about 80% of always on – Leave room to optimize

Periodically review it and move RIs

Turn off developer systems overnight – small but significant.

Stay on current generation (instance type and OS)

Better performance/$, results in lower $

Pay attention to traffic – inter AZ as well as outbound.

Compression!

Do cost estimates based on loads – have guidelines

Page 28: Herding cats in the Cloud

ACTUALLY HERDING THE CATS

Devops Working Group

Senior engineers

No managers: If you can't put hands on a keyboard to fix something going

wrong, this is not the place for you

Things are brought up, opinions are formed. Don’t attribute to individuals.

Discuss cross-cutting needs

GREAT place for the central cloud team to mine for new work

Page 29: Herding cats in the Cloud

ACTUALLY HERDING THE CATS

Central Cloud Team

Is ½ service organization and ½ cloud owner

Be nice, or the cats will go away and ignore you.

The cats are your scouts and your customers. Listen to them

so you know what's important.

Page 30: Herding cats in the Cloud

RESULTS

PROs

Maximizes velocity, agility

Scalable

Can try out different working

patterns

CONs

Inconsistent

Have to be careful about

responsibilities

You always have some weeds in

the garden

You're always trying to keep up

with developers

But at least you know it

And you're not in the way

Page 31: Herding cats in the Cloud

RESOURCES

AWS Enterprise Support

Expensive, but good

Cloud based services – lots of options here

OpEX, not CapEX (except for Ris?)

Metrics (Librato)

Cost Exploration (CloudHealth)

Page 32: Herding cats in the Cloud

TOOLS

Automated Rules/Workflow

Github Enterprise – it's Github that makes your security geeks happy.

Docker

Quay (Private Docker Hub)

Jenkins (for network Cron)

Chef (not much use any more)

Page 33: Herding cats in the Cloud

NEXT STEPS

Cloud Services Liaisons

Send a member of cloud central to each team's sprint planning

"Lunch and Learn"

Goes both ways -- NOT just the cloud central team

More Policy Automation!!!

Page 34: Herding cats in the Cloud

LESSONS LEARNED

Start when you're small – fixing the problem

after the fact is much harder

Automate everything, even when you don't “have”

to – it makes things easier to change

Have a Central Services Team to deal with cross-

cutting concerns

Put the power in the hands of people who can

make things better

Page 35: Herding cats in the Cloud

QUESTIONS?

Page 36: Herding cats in the Cloud

PHOTO CREDITS

• https://www.flickr.com/photos/pelican/6180235561

• https://www.youtube.com/watch?v=puijCrETsrY

• https://www.flickr.com/photos/afu007/2398217277

• https://www.flickr.com/photos/jurvetson/5419597546

• https://commons.wikimedia.org/wiki/File:Catch_cats_3.JPG

• https://pixabay.com/en/photos/pet/?cat=industry

• https://commons.wikimedia.org/wiki/File:White_Cat_and_a_mouse.jpg

• https://www.flickr.com/photos/dan4th/2839915202

• https://pixabay.com/en/cat-annoyed-mauzen-teeth-stress-1370024/

• https://et.wikipedia.org/wiki/Pilt:PR_Siriuksen_EeroCurl_ACS_ds_09_24_1.JPG

• https://commons.wikimedia.org/wiki/File:PR_Siriuksen_EeroCurl_ACS_ds_09_24_2.JPG

• https://commons.wikimedia.org/wiki/File:Tunnel_cat_(6414878527).jpg

• https://www.flickr.com/photos/petsadviser-pix/8652859754

• https://commons.wikimedia.org/wiki/File:Antu_mongodb.svg

• https://www.flickr.com/photos/michael-broad/4642745499

• http://maxpixel.freegreatpicture.com/Cat-Animal-Kennel-Cats-Eyes-Cute-Cat-Animals-

269047

• http://maxpixel.freegreatpicture.com/Cat-Kitty-Kitten-Cute-Pipe-Curious-Tube-Feline-

568593

• https://www.flickr.com/photos/santamonicamtns/16613805934

• http://maxpixel.freegreatpicture.com/Surprise-Kitten-Kittens-Cat-Money-Animals-Pet-

602944

• https://www.pexels.com/photo/animals-cat-pets-7792/

• https://commons.wikimedia.org/wiki/File:Cat_into_the_box.jpg

• https://commons.wikimedia.org/wiki/File:White_cat_over_water_2012.jpg

• https://pixabay.com/en/black-cat-reading-white-paper-33843/

• https://pixabay.com/en/photos/hidden/

• https://www.flickr.com/photos/editor/1195653047

• https://en.wikipedia.org/wiki/File:Exponential_Decay_Function.png

• Other Photos by Chris Williams, Dewey Sasser, and Jennifer Moore

All photos found by Google Images marked for commercial reuse, or by personal permission