11
Automated Push Monitoring and Rollback @IMVU ArchCamp Lightning talks Kishore Jalleda Director of Operations IMVU, Inc

Automated push monitoring and rollback at imvu

Embed Size (px)

Citation preview

Automated Push Monitoring and Rollback @IMVU

ArchCamp Lightning talks

Kishore JalledaDirector of Operations

IMVU, Inc

How did it all start?

• From a P1 back in 2007. • Site issues • Ops and eng identify the bad revision • Engineers commit a fix• Wait for BB to go green • Finally all tests pass • Push the fix and the site recovers • Too bad we were down for 20 minutes

How did it all start? (cont’d)• Postmortem time ( 5 whys ) • Run multiple website revisions for fast rollbacks

Typical web server directory structure website -> /home/webadmin/website.107847 ( symlink)website.107767website.107788website.107825website.107834website.107835website.107847

More evolution

• We had more P1’s and more Postmortems and more follow ups.

• Identified some common root causes– Finding changes in key metrics was manual and

sometimes took days or even weeks. – rolling back was fully scripted but required a

manual trigger • Push monitoring and auto rollbacks was born

Push Monitoring & Auto Rollback

Phase 1: 1. Push to small % of servers 2. Monitor pre and post push key metrics

1. Key metrics OK ? 1. Go to Phase 2

2. Key metrics not OK ? 1. Rollback to previous green revision

( simple symlink switch, takes seconds )

Push Monitoring & Auto Rollback(Cont’d)

Phase 2: 1. Push to rest of servers 2. Monitor pre & post push key metrics

1. Key metrics OK ? 1. Push successful

2. Key metrics not OK ? 1. Rollback to previous green revision

( simple symlink switch, takes seconds )

What if your push gets rolled back ?

You get an email with subject “rollback of r107767” The body contains something like this

Revision 107767 triggered an alarm in the cluster and was automatically rolled back to revision 107764

Details: https://foo.imvu.com/push_yyyy.php?push_phase_id=384000

kjalleda initiated the push at Fri May 13 14:46:38 2011.

Push Status Page

More evolutionThese evolved from more Postmortems / 5 Whys • Regret your last push ?, “imvu_oops” to the rescue. Along with rolling back to a previous good

revision, this will also lock commits, pushes, and sends an email to ops, eng, and on-call. • Ability to manually rollback quickly without having to go through commit/BB/push• Ability to manually push a particular revision • Ability to manually lock commits and or pushes • Automated rollbacks on any metric inaccessibility • Immune system for IMVU config variables / site switches

Expect some hurdles

• Don’t expect your push monitoring to catch everything, remember not all changes cause immediate impact, some take days or even weeks to surface

• There are inevitably going to be false positives / Intermittent issues due to a variety of reasons.

• Push settings/thresholds may need periodic tweaking to accommodate some cluster changes

• Ongoing production issues can skew some metrics which can impact pushes

• Rollbacks from un-related errors are a pain to deal with.

Thank You!

Kishore [email protected]

IMVU recognized as:Inc. 500: http://bit.ly/dv52wK

Red Herring 100: http://bit.ly/bbz5Ex

Best Place to Work: http://bit.ly/aAVdp8

(and we're hiring): http://www.imvu.com/jobs