CloudAustin Black Friday 2013

Preview:

DESCRIPTION

A 2014 CloudAustin presentation on how we prepared for and executed on our high traffic surge over Black Friday.

Citation preview

Black Friday 2013

Ernest Mueller, Bazaarvoice Engineering

What Is Black Friday?

• The National Retail Federation writes: For some retailers, the holiday season [Nov-Dec] can represent as much as 20-40% of annual sales.

• ShopperTrak says: National retail sales increased 2.7% and foot traffic decreased 14.6% when compared to the same two months last year (2012).

• Black Friday (the Friday after Thanksgiving) and Cyber Monday (the Monday after that) have become big discounting and promotional events that retailers use to push holiday purchasing.

• Summary: It’s a big deal to many of our clients and is becoming more ecomm-driven every year

3

Historically

In 2011 we served

1.52 BAnd in 2012 we served

2.03 B.

Roadmap Prediction

Bazaarvoice expected

review impressions on Black Friday & Cyber Monday 2013. That’s a 30% YoY growth rate.

Results

Bazaarvoice served

review impressions on Black Friday & Cyber Monday 2013. That’s a 31.4% YoY growth rate.

Black Friday/Cyber Monday 2013 @BV

2.67 B2.6 B

If you took all the reviews we served up to shoppers on

Black Friday 2013 and printed them into paperback book

form, it would take a bookshelf almost 11 miles long

to hold them.

Step 0: Architecture

Scaling Isn’t Just For Black Friday

• We continuously work to scale the product – our data size doubles year over year

• Architectural changes to meet the demand are constant and ongoing – there is no “maintenance mode” at scale

• Your base architecture needs to be scalable

• Then you have to refactor again and again

10

The Three Amigos

Dove’s Thoughts• Upping performance and

running your system at 40% instead of 80% gave a lot of insight into our second order set of bottlenecks and performance characteristics

• The choice of where to place/span ASGs and other Amazon bits was a major talking point among the Amigos, and ended up being located per AZ because of our DNS/HAProxy front end

• The “diagonal scaling” challenge of instance size vs number of instances vs PIOPS speed is hard and you basically just have to run tests to dial in on the minima; this changes a lot over time

• Remember, with the public cloud a lot of this is black box and while that removes a lot of work from you, it adds other work and requires certain best practices to make the most of your system

Step 1: Planning

This Year

• We started Black Friday specific work on August 12, 2013.

• That’s when client readiness surveys start coming in!

• We’ve done this previous years, but this year there was a big additional demand placed on the planning…

15

The Old Meets The New

Communicate and Coordinate

• The first step is always internal communication

• We create an “Internal Preparedness Statement” to provide a concise, definitive statement for Engineering, Sales, Support, and Implementation

• Regular weekly prep status meetings

• From the August 12 “Planning is beginning” notification till the celebratory happy hour on Dec 16, I have 1,287 emails that mention “Black Friday.”

• Due to the new distributed-team challenge, we needed a person responsible for coordinating our overall Black Friday response…

Step 2: Freezing

BV Holiday Freeze StatementSoft FreezeWe observe a general change freeze period starting 1 November and ending 15 January. During this period, we do not introduce changes to Bazaarvoice products that are integrated with our clients' websites. We may introduce changes into back-end systems that do not impact the end-user site experience.

Hard FreezeWe only release infrastructure and configuration changes required to restore service to or prevent a service disruption to one or more of our customers. The Critical System Change periods are:• 5 days prior to and 5 days after Black Friday (24 November

2013 through 4 December 2013)• 4 days prior to and 7 days after Christmas (21 December

2013 through 1 January 2014)

What Does Freeze Mean To You?

Step 3: Scaling

Traffic Projections and Scaling Plan

• Sadly, the answer isn’t as simple as “Amazon, yay!”

• Even they run out of resources over this period

• We conduct detailed YOY traffic projections

• We come up with a scaling plan to fit the projections

• Leave headroom!

Traffic Projection Tips

• Your system has various axes of scaling within it – trend and estimate them all

• We estimate incoming and outgoing reviews per day, peak requests per second on display servers, and calculate per-server acceptable capacity at each level (tomcat, Solr, database)

• Once you’ve done it one year, it’s easier because you can apply proportional lift to current traffic

• Keep an ear to the ground for environmental changes! This year retailers decided to start earlier and spike a little less on BF, so scaling came earlier than last year – but we read the news so we were prepared

0

200000000

400000000

600000000

800000000

1000000000

1200000000

1400000000

1600000000

PageviewsUGC Im-pressions

1.337 B1.330 B

Step 4: Supporting

Situational Awareness

• When the clock is running, you need your monitoring, alerting, response, etc. to be highly optimized for speed.

• We use a variety of monitoring types – nagios, zabbix, datadog, Keynote, pingdom

• And PagerDuty of course, aka “The One Ring”

• We write out runbooks for common response tasks such that we can have level 1 support people do them – or at least so that we don’t screw them up!

• Custom tooling is a must.

164k RPS

10 m2.xlarg

e

12 m2.xlarg

e

10 m2.xlarg

e

12k RPS

21k RPS

CDNHit Rate 80%TTL 600s

4330 ms

8210 ms

AWS East

AWS West

1023 ms

c1

3.4k RPS2340 ms

System Stats Histogram

3.4k RPS

1240 ms

c2

Escalated Response

• We had 3x daily (9 AM, 2 PM, 9 PM) status calls for all teams to check in

• We sent out overall status system performance to the entire company daily

• Oncall shifts of 12 hours apiece – not fully online but not “waiting for pages” either, need to be eyeballing the system at regular intervals

Step 5: Practicing

Test Your Plan!

• Test your scaling

– Amazon limits are your enemy – there’s a thousand of ‘em and many are hidden

• Test your monitoring

• Test your paging

• Test your runbooks

• We had two “game days” to scale up, apply load, provoke issues and execute on remediation

Drag picture to placeholder or click icon to add

Step 6: Profit

How It Went Down

• 23 teams across R&D and Support

• 40 engineers participating as Black Friday representatives

• 11 weeks of planning

• 2 stress-testing "Game Days”

• 26 round-the-clock status calls (8 “yellow” status, 18 “green”)

• 35 issues examined during the period

• $136,620.27 for the week in hosting costs

• Zero downtime

November Performance (c3)

Questions?

Recruiting Moment - BV:IO 2014

• Bazaarvoice’s internal tech conference and hackathon!

• Last year: Alamo Drafthouse, Adrian Cockroft (Netflix), Jason Baldridge (UT), Nick Bailey (Datastax), Peter Wang (Continuum Analytics)

• This year: Norris Conference Center, Theo Schlossnagle (Circonus), Greg Brockman (Stripe CTF), Bob Metcalf (UT)

• Late-nighter hackathon to develop sweet social commerce solutions

• Plus – COD: Black Ops!

43

Register: bvio2014.eventbrite.com

Team Signups On Hacker League

Koderz Only