Upload
kjalleda
View
1.602
Download
1
Embed Size (px)
Citation preview
Automated Push Monitoring and Rollback @IMVU
ArchCamp Lightning talks
Kishore JalledaDirector of Operations
IMVU, Inc
How did it all start?
• From a P1 back in 2007. • Site issues • Ops and eng identify the bad revision • Engineers commit a fix• Wait for BB to go green • Finally all tests pass • Push the fix and the site recovers • Too bad we were down for 20 minutes
How did it all start? (cont’d)• Postmortem time ( 5 whys ) • Run multiple website revisions for fast rollbacks
Typical web server directory structure website -> /home/webadmin/website.107847 ( symlink)website.107767website.107788website.107825website.107834website.107835website.107847
More evolution
• We had more P1’s and more Postmortems and more follow ups.
• Identified some common root causes– Finding changes in key metrics was manual and
sometimes took days or even weeks. – rolling back was fully scripted but required a
manual trigger • Push monitoring and auto rollbacks was born
Push Monitoring & Auto Rollback
Phase 1: 1. Push to small % of servers 2. Monitor pre and post push key metrics
1. Key metrics OK ? 1. Go to Phase 2
2. Key metrics not OK ? 1. Rollback to previous green revision
( simple symlink switch, takes seconds )
Push Monitoring & Auto Rollback(Cont’d)
Phase 2: 1. Push to rest of servers 2. Monitor pre & post push key metrics
1. Key metrics OK ? 1. Push successful
2. Key metrics not OK ? 1. Rollback to previous green revision
( simple symlink switch, takes seconds )
What if your push gets rolled back ?
You get an email with subject “rollback of r107767” The body contains something like this
Revision 107767 triggered an alarm in the cluster and was automatically rolled back to revision 107764
Details: https://foo.imvu.com/push_yyyy.php?push_phase_id=384000
kjalleda initiated the push at Fri May 13 14:46:38 2011.
More evolutionThese evolved from more Postmortems / 5 Whys • Regret your last push ?, “imvu_oops” to the rescue. Along with rolling back to a previous good
revision, this will also lock commits, pushes, and sends an email to ops, eng, and on-call. • Ability to manually rollback quickly without having to go through commit/BB/push• Ability to manually push a particular revision • Ability to manually lock commits and or pushes • Automated rollbacks on any metric inaccessibility • Immune system for IMVU config variables / site switches
Expect some hurdles
• Don’t expect your push monitoring to catch everything, remember not all changes cause immediate impact, some take days or even weeks to surface
• There are inevitably going to be false positives / Intermittent issues due to a variety of reasons.
• Push settings/thresholds may need periodic tweaking to accommodate some cluster changes
• Ongoing production issues can skew some metrics which can impact pushes
• Rollbacks from un-related errors are a pain to deal with.
Thank You!
Kishore [email protected]
IMVU recognized as:Inc. 500: http://bit.ly/dv52wK
Red Herring 100: http://bit.ly/bbz5Ex
Best Place to Work: http://bit.ly/aAVdp8
(and we're hiring): http://www.imvu.com/jobs