48
Preparing for CDN Failure: Why and How Aaron Peters TurboBytes @aaronpeters & @ksgyoung Kyle Young Mobify

Preparing for CDN failure: Why and how

Embed Size (px)

Citation preview

Preparing for CDN Failure:

Why and How

Aaron Peters TurboBytes

@aaronpeters & @ksgyoung

Kyle Young Mobify

Examples of CDN Failure

https://www.flickr.com/photos/phobia/2308371224/

Multi-continent outage for 17 minutes

@aaronpeters & @ksgyoung

100% broken in North-America and Europe.

Up and down for days, some ASNs only

@aaronpeters & @ksgyoung

Zoomed in on May 4 and May 5 …

@aaronpeters & @ksgyoung

OMG!

Japan, all networks: 5 hours of bad

Three days later, it happened again …

@aaronpeters & @ksgyoung

#1 eyeballs network in Germany

Single country, single ASN perf degradations are not uncommon in CDN land …

@aaronpeters & @ksgyoung

“System Maintenance”

@aaronpeters & @ksgyoung

Take-Aways

@aaronpeters & @ksgyoung

Expect your CDN to fail on you

They won’t tell you about it

Monitoring CDN Performance

https://www.flickr.com/photos/chasblackman/8502151556/

Synthetic monitoring

@aaronpeters & @ksgyoung

Pingdom, cron+curl, ThousandEyes, Interns, etc.

Synthetic monitoring

@aaronpeters & @ksgyoung

The ProsVery Specific

Known Intervals

Controlled Locations

Lots of Information

Synthetic monitoring

@aaronpeters & @ksgyoung

The ConsLimited Locations

Datacenters

Hyper-idealized Client Model

Synthetic monitoring

@aaronpeters & @ksgyoung

Still better than nothing.

RUM for Page Load Time

@aaronpeters & @ksgyoung

CDN must serve HTML + resources

Third party content adds noise

Can’t capture CDN failing

Not very useful

RUM with Test Object

@aaronpeters & @ksgyoung

Fetch small object from CDN(s) after onload

Beacon timings, or a Fail

Use Nav Timing for best insight

Challenge

get your JS on other sites too, to capture CDN failing

RUM for Page Resources

@aaronpeters & @ksgyoung

Resource Timing API

Send assets with TAO header

Onload => beacon timings

Easy, right? Not so fast …

Starting Points

@aaronpeters & @ksgyoung

Did the asset come from CDN?

Did it load fast enough?

Was it a good response (200/304)?

Fetched from network?if (total time > 20 ms) { // from network }

DOES NOT WORK !

RT API has many quirks

@aaronpeters & @ksgyoung

DNS time and Connect time always zero in IE

No data for 4xx/5xx responses, except in IE

Nothing in RT API until asset fully loaded, except in IE

FF doesn’t tell the truth

How measure CDN perf with RT API

@aaronpeters & @ksgyoung

main.css

inline JS, high in HEAD, exec only if window.chrome

setTimeOut(checkRTAPI, 5000);

at onload or when timer ended:

if ( main.css in RT API && connectTime > 3 ) { loaded fine from network } else { meh }

Too slow: Fail

transferSize attribute, FTW!

@aaronpeters & @ksgyoung

transferSize = byte size that came over the wire

if ( transferSize != 0 ) { // from network }

Status: no browser is implementing this yet

Background/discussion: https://github.com/w3c/navigation-timing/issues/3

Future

Take-Aways

@aaronpeters & @ksgyoung

Measure Fail Ratio too, not just Speed Use RUM for real-world performance insightAnd Synthetic monitoring for deep visibilityBeware of the many bugs in Res Timing API

Mitigating CDN Failure: Multi-CDN

https://www.flickr.com/photos/metali/294107810/

Selecting your CDNs

Feature & Behaviour Parity

Performance

Costs

@aaronpeters & @ksgyoung

Doing Multi-CDN

@aaronpeters & @ksgyoung

Perf dataDecision making

TargetingTime-to-Switch

High volume, high qualityMake good sense of the data, quicklyVery granular, very accurateAsap!

Where switch CDNs?

@aaronpeters & @ksgyoung

cs109.wac.edgecastcdn.net.  

cds.z4b9c4e6.hwcdn.net

//cdn1.mydomain.com/main.css  

//cdn2.mydomain.com/main.css

in DNS in HTML

OR

Traffic management in DNS

@aaronpeters & @ksgyoung

resolver authoritativeclient

I see the request comes from NL, based on resolver IP address …

… so I’ll handout the CNAME to the CDN configured for NL

CDN BCDN A

Static Geo

@aaronpeters & @ksgyoung

Always route to that CDN in that geoEasy: no need to monitor perfBut what if CDN has boo boo?

Dynamic Geo

@aaronpeters & @ksgyoung

Always route to best CDN per geoNeeds solid perf data (RUM !)Geo targeting accuracy important

Dynamic Geo + ASN

@aaronpeters & @ksgyoung

Holy grail: gives best results

Really needs RUM data, and lots of it

Targeting accuracy even more important

Geo targeting gone wrong

@aaronpeters & @ksgyoung

8.8.8.8 authoritativeclient

I see the request comes from MY, based on resolver IP address …

… so I’ll handout the CNAME to the CDN configured for MY

CDN B - best in India !CDN A

EDNS0 to the rescue !

@aaronpeters & @ksgyoung

8.8.8.8 authoritativeclient

I see the request comes from IN, based on client IP address /24 …

… so I’ll handout the CNAME to the CDN configured for IN

CDN B - best in India !CDN A

Decision Making

@aaronpeters & @ksgyoung

Look at everything, not just ‘Response Time’

Use multiple statistics, not just median

Make your ‘decider’ sensitive to Fail Ratio !

Tuning your logic takes time

Coping with low volume data

@aaronpeters & @ksgyoung

Don’t make changes

Make decisions with lower confidence

Have a dynamic targeting granularity

Experiment: do a Pat Meenan !

@aaronpeters & @ksgyoung

http://www.slideshare.net/patrickmeenan/service-workers-for-performance

Hi, I’m Pat

Example CDN Perf Program

http://bit.ly/1KppCfo

@aaronpeters & @ksgyoung

Example CDN Perf Program

@aaronpeters & @ksgyoung

Example CDN Perf Program

Limitations

Not Practical for Monitoring

Humans are Required

Misses Important Factors (EG SSL)

Hard to Commit to Bandwidth

War Story

Seattle Down

@aaronpeters & @ksgyoung

or How we Started Caring About MultiCDN

@aaronpeters & @ksgyoung

@aaronpeters & @ksgyoung

Yay, Monitoring!

@aaronpeters & @ksgyoung

Support Lines - A Tale of Lowered Expectations

IT Crowd - Fremantle MediaBatman - 20th Century Fox

@aaronpeters & @ksgyoung

SLA Reminder99.9% 8 hours99.99% 52 minutes99.999% 5 minutes

@aaronpeters & @ksgyoung

Best EffortSwitch over to an alternate CDN for the entire service, across the globe, as per the “Holy S#!t Handbook”.

@aaronpeters & @ksgyoung

This Isn’t PrettyCold Cache

Backend Thrashing

3 to 4 hours of Intermittent Failure

@aaronpeters & @ksgyoung

Lessons LearnedHot Standby

Geographic DNS Control, and Optimization

More Monitoring

@aaronpeters & @ksgyoung

Thank you.

@aaronpeters - [email protected]@ksgyoung - [email protected]

Questions?