Download pdf - Web 2.0 Performance and Reliability: How to Run Large Web Apps

Artur [email protected]

• Wikia Inc– We are hiring– Community/Bizdev in Germany– Engineers in Poland– http://www.wikia.com/wiki/hiring

• O’Reilly Radar– http://radar.oreilly.com/artur/

The value of operations

• Google• Orkut• Friendster• Myspace

Benefits

• Users trust your brand• They rely on you• They spend more time on your site• Bad operations wastes R&D money

• Fixed amount of time + faster site = more page views

Stepchild of Engineering

• Product development• Engineering• Operations

– Sysadmins?• Why?

Operations Engineering

• It is engineering• Google terminology -

– Site Reliability Engineer• Sure there are sysadmins too, people

mananing NOCs and datacenters• Provide career growth

Good Engineers

• Detail Oriented• Aspire to be operational engineers• Stubborn• Can steer their inner ADD

– Interrupt driven• Not the same as good developers

Danger signs

• Thinks operation is a path to development engineering– Fire them

• Want people dedicated to the task• A good operations engineer should

spend some time in development• A good development engineer MUST

spend some time in operations

Debugging

• 9 Rules of debugging• http://www.debuggingrules.com/Poster_

download.html– Yes the font is horrible

Rule 1: Understand the system

• Complexity Kills• No excuse• If you write it, you must know it• If you run it, you must know it• If you buy it, you must know it

Rule 3:Quit thinking and look

• "It is a capital mistake to theorize before one has data. Insensibly one begins to twist facts to suit theories, instead of theories to suit facts.”

Rule 3:Quit thinking and look

• What do you look at?• The importance of monitoring• Monitoring• Monitoring• Monitoring

My my, confusing term

• Monitoring• Alerting• Trending

Monitoring

• Collects data• Puts into databases• Makes it available for you• Active collection• Passive interaction

Alerting

• Acts on monitoring data• Severe alerts

– Active– Needs action

• Passive alerts– Things that need to be done but not right now

• DO NOT OVER ALERT• DO NOT CRY WOLF

Wikia alerting strategy

• When the site is slow• Or down• We send emails and do phone calls• Europe and US West coast• Looking to hire in East Asia• No night time

Trending

• Long term • Capacity planning

Monitor Tools

• Nagios• Cacti• MRTG• Hyperic• Cricket• Ganglia

External Monitoring

• Use one, tells you what your clients see every x minutes

• Keynote• Gomez• Websitepulse (cheap - easy - I like

them; no annoying salesforce)

Nagios

• Alerting• Hassle• C CGI??• Doesn’t

scale

Hyperic

• Most exciting open source tool• Agent base - self configured• Baseline alerting

Cricket MRTG Cacti

• Impossible to configure• You need to write tools to do it• Especially Cacti

– Somewhat more pleasant than clawing out your eyes

Ganglia

• We love ganglia• Automatically graphs everything you

want - just works• Large scale clusters• Multicast• Zero config• RRD

http://ganglia.wikimedia.org/

• 270 hosts• 880 CPU• 2 clusters• 1.2 TB of Memory

http://ganglia.wikimedia.org

Custom Ganglia Gmetrics

• Write your own

gmetric --name='Oldest query' --type=int32 --units='sec' --dmax=65 --value=`echo 'show processlist' | mysql -uroot -ppass | grep -v Sleep | grep -v 'system user' | head -2 | tail -1 | cut -f 6`


• Or Learn Unix



• Write your own



• Write your own



• Write your own

gmetric --name='Oldest query' --type=int32--units='sec' --dmax=65 --value=`echo 'show processlist' | mysql -uroot -ppass | grep -v Sleep | grep -v 'system user' | head -2 | tail -1 | cut -f 6`

Something is wrong

• Don’t worry, data warehouse

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

tcpdump / waveshark

• If you suspect the network• Don’t just suspect• LOOK AT IT• Tcpdump / waveshark will tell you

– If your packets are lost, delayed or corrupted

– Your windowing is wrong

Rule 4: Divde and Conquer

• Look at the problems in turn• Split between people• Go in the order you suspect is the most

likely

Rule 5:Change one thing at a time

• I cannot stress this enough• IF YOU DO NOT THEN YOU HAVE

FAILED TO IDENTIFY THE PROBLEM

Rule 6:Keep an audit trail

• You might be making things worse• Good for the root cause analysis• Have your shell log all commands

– Good practice anyway• Version control

Rule 9:If you didn’t fix it, it ain’t fixed

• You must do something to fix a problem• Or it will bite you again• And again• And again• They don’t just appear and disappear• Except BGP route convergence :)

Process

• You need a little• Don’t worry

Don’t forget

Complexity kills

• Design against it• Reuse components• Define standards• Have a few images that all machines

look like - reimage machines every now and then for the heck of it.– EC2 forces you to do this

MTBFMeduim Time Between Failure• Actually mostly irrelevant• Dealing with failure is more important• Target the right uptime

– Complexity scales exponatially with required uptime

• Don’t kid yourself, you don’t need 5 nines

MTTRMedium Time To Recovery

• Important• Noone cares if you fail once a minute

– If you recover in 50 ms• If you are down 1 minute a week, you

are still going to hit 4 nines (99.99%)• Failures happen, plan how to deal with

them

Problem found

• If it is critical, start a phone conversation• Use IRC to communicate technical data• One person liasons with non technical

staff• One person specifically in command• Sleep scheduling ( audit log important )

Post crisis

• Root cause analysis – Just find out what went wrong– And how to avoid it– Or fix it faster next time if you can’t

• Keep track of your uptime

Automation

• All machines are created equal• Seriously• If you manually make changes• You are wrong

– Unless you know what you are doing

Best practices

• Version control• Gold images• Centralised authentication• Time Sync ( NTP )• Central logging• ( All of this applies for virtual machines

too!)

cfengine

• Standard automation tool• Written in C• Not much support• Very good• Very annoying

control:

site = ( mysite ) domain = ( mysite.country )

sysadm = ( mark ) netmask = ( 255.255.255.0 ) actionsequence = ( mountall mountinfo

addmounts mountall links ) mountpattern = /$(site)/$(host))

homepattern = ( u? )

Puppet

• New hip kid on the block• Written in ruby• Better support?• Much nicer syntax• Easier to extend

define yumrepo (enabled = true)

{ configfile {"/etc/yum.repos.d/$name.repo”: mode => 644,

source => "/yum/repos/$name.repo", ensure => $enabled ? { true => file, default => absent } }}

cobbler

• Automatic PXE Installer– Uses kickstart files

• Redhat Enterprise• Centos• Fedora• Some support for debian

cobbler

cobbler system add --name=xen8 --mac=00:19:B9:EE:6D:0A --ip=10.10.30.208 --profile=Centos-5-x86_64 --kopts='ksdevice=00:19:B9:EE:6D:0A

console=ttyS1,57600 console=tty0'

cobbler

cobbler system add --name=xen8 --mac=00:19:B9:EE:6D:0A --ip=10.10.30.208 --profile=Centos-5-x86_64 --kopts='ksdevice=00:19:B9:EE:6D:0A

console=ttyS1,57600 console=tty0’

koan

• Client install tool– Xen– Or OS re-image

koan --server=10.10.30.205 --virt --profile=virt_fc6 --virt-name=otrs

Your datacenter

• Keep it tidy– Label things, keep cables as short as possible– Have a switch in each rack

• If you are small without dedicated DC staff you need– Remote control power switches– Remote console!

Virtualization

• Please use it• Managing becomes much easier• Power consumption• Need a new test box

– The requestor can have it in minutes

Power consumption

• Maybe not as important in Europe• 8 core machines are more efficient than

1 core• But memcache uses 1 core and all RAM• Get more RAM and virtualise

Our network admin boxes

• 1 Xen CPU for Vyatta• 1 Xen CPU for LVS• 1 Xen CPU for Squid - Carp• 1 Xen CPU for Squid• 1 Xen CPU for Monitoring• 1 Xen CPU for network tasks

• We can have more of these and a loss of one affects us less

Vyatta

• Opensource router– Really like it– No need to use Cisco

LVS

• Linux Virtual Server• Low level load balancer• HA• Fast• Doesn’t inspire people to put things in

the only place that is hard to scale

Squid Carp

• Squids configured to hash the urls and send them to specific backend

• Very little configuration done• Logging of UDP - no disk IO

Squid

• As a reverse web accelerator• 90 % of our hits served from RAM in less than

1 ms• Same as wikipedia• We only use RAM cache ( unlike wikipedia)• Cached per user• If not cacheable - cache for a second to

redue backend effect

App servers

• 1 xen cpu for memcache ( 5 GB Ram)• 1 xen cpu for squid ( 5GB Ram )• 6 xen cpus for apache (6 GB Ram )

• More power efficient, less affected by loss

• Applications can’t affect each other

Databases

• Keep developers on short leash• Report bad queries• Fear object relational mappers

Outsourcing

• As much as possible• The younger you are as a company the

less risk– When you have no users, you have no

value• VCs don’t like having their money go

into Capex

What I want from Vendors

• They do what they tell me• They do what I tell them

• No annoying up sells, no premium services– I know more about what you are selling

than you

Services we use

• Amazon EC2 and S3• Panther-Express

Panther Express

• Fantastic Content Distribution Network• Cheap, simple price list

– Take note akamai• Cut delivery time to Europe by 70%• We let our images be cached 1 second

to redue load

EC2 and S3

• We save all our binlogs to S3• We save database dumps to S3• We have monitors running from EC2• We plan to build a datawarehouse

cluster on EC2

EC2 Requires Automation

• Machine is blank when you bring it up• Download database dump from S3 and

replicate up - automatically• Use puppet• Amazon saves you hardware

headaches– But complexity is still a problem

Thank you