Artur [email protected]
• Wikia Inc– We are hiring– Community/Bizdev in Germany– Engineers in Poland– http://www.wikia.com/wiki/hiring
• O’Reilly Radar– http://radar.oreilly.com/artur/
The value of operations
• Google• Orkut• Friendster• Myspace
Benefits
• Users trust your brand• They rely on you• They spend more time on your site• Bad operations wastes R&D money
• Fixed amount of time + faster site = more page views
Stepchild of Engineering
• Product development• Engineering• Operations
– Sysadmins?• Why?
Operations Engineering
• It is engineering• Google terminology -
– Site Reliability Engineer• Sure there are sysadmins too, people
mananing NOCs and datacenters• Provide career growth
Good Engineers
• Detail Oriented• Aspire to be operational engineers• Stubborn• Can steer their inner ADD
– Interrupt driven• Not the same as good developers
Danger signs
• Thinks operation is a path to development engineering– Fire them
• Want people dedicated to the task• A good operations engineer should
spend some time in development• A good development engineer MUST
spend some time in operations
Debugging
• 9 Rules of debugging• http://www.debuggingrules.com/Poster_
download.html– Yes the font is horrible
Rule 1: Understand the system
• Complexity Kills• No excuse• If you write it, you must know it• If you run it, you must know it• If you buy it, you must know it
Rule 3:Quit thinking and look
• "It is a capital mistake to theorize before one has data. Insensibly one begins to twist facts to suit theories, instead of theories to suit facts.”
Rule 3:Quit thinking and look
• What do you look at?• The importance of monitoring• Monitoring• Monitoring• Monitoring
My my, confusing term
• Monitoring• Alerting• Trending
Monitoring
• Collects data• Puts into databases• Makes it available for you• Active collection• Passive interaction
Alerting
• Acts on monitoring data• Severe alerts
– Active– Needs action
• Passive alerts– Things that need to be done but not right now
• DO NOT OVER ALERT• DO NOT CRY WOLF
Wikia alerting strategy
• When the site is slow• Or down• We send emails and do phone calls• Europe and US West coast• Looking to hire in East Asia• No night time
Trending
• Long term • Capacity planning
Monitor Tools
• Nagios• Cacti• MRTG• Hyperic• Cricket• Ganglia
External Monitoring
• Use one, tells you what your clients see every x minutes
• Keynote• Gomez• Websitepulse (cheap - easy - I like
them; no annoying salesforce)
Nagios
• Alerting• Hassle• C CGI??• Doesn’t
scale
Hyperic
• Most exciting open source tool• Agent base - self configured• Baseline alerting
Cricket MRTG Cacti
• Impossible to configure• You need to write tools to do it• Especially Cacti
– Somewhat more pleasant than clawing out your eyes
Ganglia
• We love ganglia• Automatically graphs everything you
want - just works• Large scale clusters• Multicast• Zero config• RRD
http://ganglia.wikimedia.org/
• 270 hosts• 880 CPU• 2 clusters• 1.2 TB of Memory
http://ganglia.wikimedia.org
Custom Ganglia Gmetrics
• Write your own
gmetric --name='Oldest query' --type=int32 --units='sec' --dmax=65 --value=`echo 'show processlist' | mysql -uroot -ppass | grep -v Sleep | grep -v 'system user' | head -2 | tail -1 | cut -f 6`
Custom Ganglia Gmetrics
• Or Learn Unix
gmetric --name='Oldest query' --type=int32 --units='sec' --dmax=65 --value=`echo 'show processlist' | mysql -uroot -ppass | grep -v Sleep | grep -v 'system user' | head -2 | tail -1 | cut -f 6`
Custom Ganglia Gmetrics
• Write your own
gmetric --name='Oldest query' --type=int32 --units='sec' --dmax=65 --value=`echo 'show processlist' | mysql -uroot -ppass | grep -v Sleep | grep -v 'system user' | head -2 | tail -1 | cut -f 6`
Custom Ganglia Gmetrics
• Write your own
gmetric --name='Oldest query' --type=int32 --units='sec' --dmax=65 --value=`echo 'show processlist' | mysql -uroot -ppass | grep -v Sleep | grep -v 'system user' | head -2 | tail -1 | cut -f 6`
Custom Ganglia Gmetrics
• Write your own
gmetric --name='Oldest query' --type=int32--units='sec' --dmax=65 --value=`echo 'show processlist' | mysql -uroot -ppass | grep -v Sleep | grep -v 'system user' | head -2 | tail -1 | cut -f 6`
Something is wrong
• Don’t worry, data warehouse
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
tcpdump / waveshark
• If you suspect the network• Don’t just suspect• LOOK AT IT• Tcpdump / waveshark will tell you
– If your packets are lost, delayed or corrupted
– Your windowing is wrong
Rule 4: Divde and Conquer
• Look at the problems in turn• Split between people• Go in the order you suspect is the most
likely
Rule 5:Change one thing at a time
• I cannot stress this enough• IF YOU DO NOT THEN YOU HAVE
FAILED TO IDENTIFY THE PROBLEM
Rule 6:Keep an audit trail
• You might be making things worse• Good for the root cause analysis• Have your shell log all commands
– Good practice anyway• Version control
Rule 9:If you didn’t fix it, it ain’t fixed
• You must do something to fix a problem• Or it will bite you again• And again• And again• They don’t just appear and disappear• Except BGP route convergence :)
Process
• You need a little• Don’t worry
Don’t forget
Complexity kills
• Design against it• Reuse components• Define standards• Have a few images that all machines
look like - reimage machines every now and then for the heck of it.– EC2 forces you to do this
MTBFMeduim Time Between Failure• Actually mostly irrelevant• Dealing with failure is more important• Target the right uptime
– Complexity scales exponatially with required uptime
• Don’t kid yourself, you don’t need 5 nines
MTTRMedium Time To Recovery
• Important• Noone cares if you fail once a minute
– If you recover in 50 ms• If you are down 1 minute a week, you
are still going to hit 4 nines (99.99%)• Failures happen, plan how to deal with
them
Problem found
• If it is critical, start a phone conversation• Use IRC to communicate technical data• One person liasons with non technical
staff• One person specifically in command• Sleep scheduling ( audit log important )
Post crisis
• Root cause analysis – Just find out what went wrong– And how to avoid it– Or fix it faster next time if you can’t
• Keep track of your uptime
Automation
• All machines are created equal• Seriously• If you manually make changes• You are wrong
– Unless you know what you are doing
Best practices
• Version control• Gold images• Centralised authentication• Time Sync ( NTP )• Central logging• ( All of this applies for virtual machines
too!)
cfengine
• Standard automation tool• Written in C• Not much support• Very good• Very annoying
control:
site = ( mysite ) domain = ( mysite.country )
sysadm = ( mark ) netmask = ( 255.255.255.0 ) actionsequence = ( mountall mountinfo
addmounts mountall links ) mountpattern = /$(site)/$(host))
homepattern = ( u? )
Puppet
• New hip kid on the block• Written in ruby• Better support?• Much nicer syntax• Easier to extend
define yumrepo (enabled = true)
{ configfile {"/etc/yum.repos.d/$name.repo”: mode => 644,
source => "/yum/repos/$name.repo", ensure => $enabled ? { true => file, default => absent } }}
cobbler
• Automatic PXE Installer– Uses kickstart files
• Redhat Enterprise• Centos• Fedora• Some support for debian
cobbler
cobbler system add --name=xen8 --mac=00:19:B9:EE:6D:0A --ip=10.10.30.208 --profile=Centos-5-x86_64 --kopts='ksdevice=00:19:B9:EE:6D:0A
console=ttyS1,57600 console=tty0'
cobbler
cobbler system add --name=xen8 --mac=00:19:B9:EE:6D:0A --ip=10.10.30.208 --profile=Centos-5-x86_64 --kopts='ksdevice=00:19:B9:EE:6D:0A
console=ttyS1,57600 console=tty0’
koan
• Client install tool– Xen– Or OS re-image
koan --server=10.10.30.205 --virt --profile=virt_fc6 --virt-name=otrs
Your datacenter
• Keep it tidy– Label things, keep cables as short as possible– Have a switch in each rack
• If you are small without dedicated DC staff you need– Remote control power switches– Remote console!
Virtualization
• Please use it• Managing becomes much easier• Power consumption• Need a new test box
– The requestor can have it in minutes
Power consumption
• Maybe not as important in Europe• 8 core machines are more efficient than
1 core• But memcache uses 1 core and all RAM• Get more RAM and virtualise
Our network admin boxes
• 1 Xen CPU for Vyatta• 1 Xen CPU for LVS• 1 Xen CPU for Squid - Carp• 1 Xen CPU for Squid• 1 Xen CPU for Monitoring• 1 Xen CPU for network tasks
• We can have more of these and a loss of one affects us less
Vyatta
• Opensource router– Really like it– No need to use Cisco
LVS
• Linux Virtual Server• Low level load balancer• HA• Fast• Doesn’t inspire people to put things in
the only place that is hard to scale
Squid Carp
• Squids configured to hash the urls and send them to specific backend
• Very little configuration done• Logging of UDP - no disk IO
Squid
• As a reverse web accelerator• 90 % of our hits served from RAM in less than
1 ms• Same as wikipedia• We only use RAM cache ( unlike wikipedia)• Cached per user• If not cacheable - cache for a second to
redue backend effect
App servers
• 1 xen cpu for memcache ( 5 GB Ram)• 1 xen cpu for squid ( 5GB Ram )• 6 xen cpus for apache (6 GB Ram )
• More power efficient, less affected by loss
• Applications can’t affect each other
Databases
• Keep developers on short leash• Report bad queries• Fear object relational mappers
Outsourcing
• As much as possible• The younger you are as a company the
less risk– When you have no users, you have no
value• VCs don’t like having their money go
into Capex
What I want from Vendors
• They do what they tell me• They do what I tell them
• No annoying up sells, no premium services– I know more about what you are selling
than you
Services we use
• Amazon EC2 and S3• Panther-Express
Panther Express
• Fantastic Content Distribution Network• Cheap, simple price list
– Take note akamai• Cut delivery time to Europe by 70%• We let our images be cached 1 second
to redue load
EC2 and S3
• We save all our binlogs to S3• We save database dumps to S3• We have monitors running from EC2• We plan to build a datawarehouse
cluster on EC2
EC2 Requires Automation
• Machine is blank when you bring it up• Download database dump from S3 and
replicate up - automatically• Use puppet• Amazon saves you hardware
headaches– But complexity is still a problem
Thank you