HERDING CATS IN THE CLOUDMAINTAINING OPERATIONAL SANITY IN A CLOUDY, DEVOPS WORLD
Dewey Sasser
Consulting Cloud Architect
Algined Software
ABOUT THIS TALK
Public Clouds can give developers unprecedented levels
of power
“With Great Power Comes Great Responsibility”
You must structure your development and production
deployment process to use this power well
How do we do this? Experience from a large deployment
ABOUT DEWEY
Distributed Application Developer for 20 years
Doing build/release/software process for about that long
Accidentally doing devops out of self-defense
Wandered in operations about 5 years ago
Built some private cloud for dev
Built some private cloud for prod
Starting architecting using public cloud for everything
ABOUT THE COMPANY
Company Policy: don’t talk for the company
Therefore, these slides don't mention The Company.
There is no information here that is not otherwise publicly available.
Whoever it is, I don't speak for them
Major Gaming Company, multiple AAA titles
History in MMOs
All in on mobile now
ROADMAP
What we Did
What we're Doing
What we might do Next
WE'RE COMING FROM...
Traditionally MMOs in colo
Windows (ugh!) based servers
All in cloud now: mobile, cloud, Docker, MongoDB, Phoenix
Servers, Chaos Monkey, (...other popular buzzwords)
GOALS
100% uptime: players want to play
No more: Patch days, "Down for maintenance"
Profit ( = revenue – cost)
SCALE
$100ks of monthly spend
Many hundreds of instances
Around 500TB of monthly transfer
Peak to 12k tps (for a single title)
Around 1 PB of storage
Approximately 5 billion I/Os monthly
USAGE/LOAD PATTERN
Traditional SAS assumes starting small and scaling. Scaling
quickly is a problem, but a good problem.
Games are weird
Peak usage is release day, it tails off after that
You must be able to scale out of the gate. Users that cannot use
it the first day will often never be back!
PLATFORMS
Swarm pattern
Pods of services
Python/NGINX
Batch Processing pattern
Vertica
Elastic Map/Reduce
Work Queue (Kafka)
NoSQL (MongoDB – ugh!)
Gaming Platform
CoreOS/Docker
Strong Phoenix Server pattern
PROCESS/SOCIAL APPROACH
Must be (people) scalable
Working on 3 new games at any one time
Still supporting old games
Supporting services for the larger company
Don't create a bottleneck
“I'm waiting for a VM”. Bad process. No biscuit.
There are too many controls to get least privilege right!
Validation, not prevention (WHAT???)
POLICIES ARE GREAT, BUT...
They change over time
Are hard to get exactly right up front
Always have exceptions
The space of AWS permissions is HUGE. Permutations are deadly.
So...measure what you care about.
What you care about will change over time.
Trust...and verify
POWER TO THE PEOPLE (OR DEVELOPERS)
Don't gate productivity on fine points of arbitrary policies
Keep responsibility with dev team
domain expertise
put the pain where the control is
Stuff gets automated!!!
APPROACH
Cloud Environment
Multiple accounts (~ 2 dozen right now)
1 central services account
1 account per title
All environments in different VPCs (Dev, QA, Perf, Staging, Prod)
DEV TEAMS RESPONSIBLE FOR...
Developing, validating, deploying and running their games
Responding to production issues
PRODUCTION cost control
CENTRAL "CLOUD SERVICES" TEAM
“Owns”
Metrics, Monitoring, Alerting
Enables use of central services & good practices
Composable components used by the teams
Native packaging -- make it easy
Manages good practices
Their job is to be cloud experts
But they're not the only ones in the company
LOTS of conversation!
Automates everything non-project specific
New account creation, ...
OWNERSHIP/RESPONSIBILITY
Clearly align authority and responsibility.
If a Dev is getting up in the middle of the night to fix
something, they have to have full power to fix it.
On a related note, that means the teams get approval
control over a great deal
GREAT, HOW?
CRITICAL TOOL: RULES & WORKFLOWS
Custom developed rules/workflow system
Rules are small, stateless snippets of Python code that
trigger workflows
But can be company public and extensible by pull request
Workflows are potentially long running, stateful
operations that trigger list of changes.
Can also be company public, but tighter controls around
changes.
Changes can be reviewed manually or automatically.
CRITICAL TOOL: RULES & WORKFLOWS
Runtime is HIGHLY privileged – keep it tight!
This tool can destroy the world – but it
actually keeps it running.
(you have everything automated to recreate the
world, right?)
USER ACCESS CONTROL
Automate user management/creation from source in GIT
Define membership rules as intersection of desired group and
account characteristics (MFA anyone?)
Rules/Workflow enforces MFA. Central team doesn't have to
Remove your MFA, get demoted to “User”
USER ACCESS CONTROL
Don't try for least privilege – you won't get it right and it will be different tomorrow
There are a small number of access levels and people are sorted into those levels per
account
User
ReadOnly (Manager)
Finance
Developer
DevOps
FullAdmin
USER ACCESS CONTROL NG
Federation? Yes, but there are issues
SSO? Likewise
We'll probably go to a SAML based federated MFA
gateway
We might go to AD based access
NETWORK ACCESS CONTROL
VPN into the cloud
Bastion hosts
Private VPCs
Shared root keys
Yup, shared.
No user management on individual nodes
Cattle, not cats
COST CONTROL
It's a thing.
It's a really BIG THING!
COST CONTROL
Tagging policy
Owner (who to go to)
Environment (Dev, Prod, QA, …)
Project (Cost Center – DO NOT USE THIS FOR AUTOMATIOLN!)
Enforce tagging by rules/workflow process
Measure compliance, escalate to GM
Kill off instances that don't comply
With lots of warning
Now tools will give good data
CloudHealth (there are others)
WHAT YOU CARE ABOUT WITH COSTS (AWS SPECIFIC)
Reserved Instances
Go for about 80% of always on – Leave room to optimize
Periodically review it and move RIs
Turn off developer systems overnight – small but significant.
Stay on current generation (instance type and OS)
Better performance/$, results in lower $
Pay attention to traffic – inter AZ as well as outbound.
Compression!
Do cost estimates based on loads – have guidelines
ACTUALLY HERDING THE CATS
Devops Working Group
Senior engineers
No managers: If you can't put hands on a keyboard to fix something going
wrong, this is not the place for you
Things are brought up, opinions are formed. Don’t attribute to individuals.
Discuss cross-cutting needs
GREAT place for the central cloud team to mine for new work
ACTUALLY HERDING THE CATS
Central Cloud Team
Is ½ service organization and ½ cloud owner
Be nice, or the cats will go away and ignore you.
The cats are your scouts and your customers. Listen to them
so you know what's important.
RESULTS
PROs
Maximizes velocity, agility
Scalable
Can try out different working
patterns
CONs
Inconsistent
Have to be careful about
responsibilities
You always have some weeds in
the garden
You're always trying to keep up
with developers
But at least you know it
And you're not in the way
RESOURCES
AWS Enterprise Support
Expensive, but good
Cloud based services – lots of options here
OpEX, not CapEX (except for Ris?)
Metrics (Librato)
Cost Exploration (CloudHealth)
TOOLS
Automated Rules/Workflow
Github Enterprise – it's Github that makes your security geeks happy.
Docker
Quay (Private Docker Hub)
Jenkins (for network Cron)
Chef (not much use any more)
NEXT STEPS
Cloud Services Liaisons
Send a member of cloud central to each team's sprint planning
"Lunch and Learn"
Goes both ways -- NOT just the cloud central team
More Policy Automation!!!
LESSONS LEARNED
Start when you're small – fixing the problem
after the fact is much harder
Automate everything, even when you don't “have”
to – it makes things easier to change
Have a Central Services Team to deal with cross-
cutting concerns
Put the power in the hands of people who can
make things better
QUESTIONS?
PHOTO CREDITS
• https://www.flickr.com/photos/pelican/6180235561
• https://www.youtube.com/watch?v=puijCrETsrY
• https://www.flickr.com/photos/afu007/2398217277
• https://www.flickr.com/photos/jurvetson/5419597546
• https://commons.wikimedia.org/wiki/File:Catch_cats_3.JPG
• https://pixabay.com/en/photos/pet/?cat=industry
• https://commons.wikimedia.org/wiki/File:White_Cat_and_a_mouse.jpg
• https://www.flickr.com/photos/dan4th/2839915202
• https://pixabay.com/en/cat-annoyed-mauzen-teeth-stress-1370024/
• https://et.wikipedia.org/wiki/Pilt:PR_Siriuksen_EeroCurl_ACS_ds_09_24_1.JPG
• https://commons.wikimedia.org/wiki/File:PR_Siriuksen_EeroCurl_ACS_ds_09_24_2.JPG
• https://commons.wikimedia.org/wiki/File:Tunnel_cat_(6414878527).jpg
• https://www.flickr.com/photos/petsadviser-pix/8652859754
• https://commons.wikimedia.org/wiki/File:Antu_mongodb.svg
• https://www.flickr.com/photos/michael-broad/4642745499
• http://maxpixel.freegreatpicture.com/Cat-Animal-Kennel-Cats-Eyes-Cute-Cat-Animals-
269047
• http://maxpixel.freegreatpicture.com/Cat-Kitty-Kitten-Cute-Pipe-Curious-Tube-Feline-
568593
• https://www.flickr.com/photos/santamonicamtns/16613805934
• http://maxpixel.freegreatpicture.com/Surprise-Kitten-Kittens-Cat-Money-Animals-Pet-
602944
• https://www.pexels.com/photo/animals-cat-pets-7792/
• https://commons.wikimedia.org/wiki/File:Cat_into_the_box.jpg
• https://commons.wikimedia.org/wiki/File:White_cat_over_water_2012.jpg
• https://pixabay.com/en/black-cat-reading-white-paper-33843/
• https://pixabay.com/en/photos/hidden/
• https://www.flickr.com/photos/editor/1195653047
• https://en.wikipedia.org/wiki/File:Exponential_Decay_Function.png
• Other Photos by Chris Williams, Dewey Sasser, and Jennifer Moore
All photos found by Google Images marked for commercial reuse, or by personal permission