SmugMug's Zero-Downtime Migration to AWS (ARC312) | AWS re:Invent 2013

Andrew Shieh, SmugMug Operationsshandrew @ smugmug.comNovember 15, 2013

SmugMug’s Zero Downtime Migration to AWSARC312

Friday, November 15, 13

SmugMug—Who are we?

The early days of SmugMug• Gradual bootstrapped growth• Multiple self-managed datacenter cages• Too many servers of varying types• Too many disks• Tons of valuable skilled employee

hours spent in cages

DataCenter Fantasy

Data Center Reality

SmugMug <3 AWS• Early adopter of Amazon S3• Over the years, moved rendering,

upload, archiving, payments, permissions, email, and more compute to AWS

• Before mid-2012, no ultra-high performance I/O

SmugMug Architecture ~2006

AWS: S3

AWS: S3SV: Web, DB, Image*

SmugMug Architecture ~2011

AWS: S3

AWS: S3, Image (upload, processing, render, video, …) SV: Web, DB

SmugMug Architecture - Transition

AWS: S3

AWS: S3, Image*, WebSV: Web, DB

DC: Replication DB, Direct Connect

SmugMug Architecture Today

AWS: S3, Image*, Web, DBØ

How did we get there?

Our database I/O evolution:Always cutting edge• Started with MySQL on spinning

disk RAID, max RAM• Moved to ZFS SSD + SSD cache +

spinning disks• Moved to custom 24-SSD arrays

hi1.4xlarge FTW• our custom, obscure hardware =>

difficult to resolve problems,difficult to upgrade

• hi1 overall DB IO performance comparable to 8 x SSD RAID10

• < 3%/yr hi1 instance failure rate!

Amazon VPC - also a big win• Easy mapping of internal / external network security

model to AWS

Zero downtime move?

Zero Downtime Move• Flexibility of the AWS cloud

makes a zero downtime move inexpensive. Pay for only what you use. Provision fast.

• Plan• Test• Plan and test again

Major changes post-move• Database storage goes from SSD to

hi1.4xlarge ephemeral• Hardware load balancers become

Elastic Load Balancing load balancers

Major changes post-move• Database storage goes from SSD to

hi1.4xlarge ephemeral• Hardware load balancers become ELB• haproxy layer 7 load/traffic directing

goes from static to dynamic config• Web servers autoscale for each cluster• Membase to ElastiCache (later to

Amazon EC2)

Zero Downtime Move Requirements• Read-only site mode• Traffic control — shadow load• Cross country MySQL replication +

sufficient bandwidth

Zero Downtime Move Requirements• Read-only site mode• Traffic control — shadow load• Cross country MySQL replication +

sufficient bandwidth

• Bot testing• Read-only live site testing w/ QA

More on moving• Full scale read-write testing

is difficult• Be aware of AWS limits• Talk to support for big

growth• Roll back plan - manage

risky change

Flipping the switch to AWS• “The biggest, scariest engineering

change we've made in the company's history” - Don, SmugMug Chief Geek

• Go read-only (1 min)• Pre-Scale up big• MHA to reassign MySQL

masters and their replication (30min)• Point DNS+CDN to Elastic Load

Balancing (5-30m)

Flipping the switch to AWS• Test! (60 min)• When Read-only is

all good, go to read-write (5 min)

• Test! Inevitable bugs at this step (hours)

MHA?• Facebook, DeNA

• Helps to reliably reassign MySQL masters and replication, maintaining consistency

MHA?• Manual failover in MySQL

5.5 and earlier is painful, time-consuming

• Be careful with automation for rare events — it can bite

Problems?• Completely redundant

network links can fail• Bugs related to IP address

change• ElastiCache performance• NewRelic! Use it or a similar

APM product

Results

Results• Data Center - performance fluctuated

through day• AWS w/scaling - flat performance

throughout the day - significant scalability limits removed

• Networking was a key improvement• Success!

Lessons Learned• We love AWS even more than before• Automate everything• Understand Amazon EBS, and

understand underlying details of AWS services

• Unpredictable Ops schedules vs. large projects

Lessons Learned

Job #1: Making business happen

We made more changes, because we could• As long as we’re moving our infrastructure,

why not rebuild most of it too?• Linux, MySQL, package versions upgraded• New monitoring tools• NFS dependencies eliminated, moved to

Amazon S3 or DynamoDB• Code pushes managed by nice distributed

tools utilizing Amazon S3 + internal torrent

One last thing...• Go Multi-availability-zone!• Load balancers send traffic to multiple

haproxy per AZ with AZ-specific web clusters, DB replicas

• Backed up w/ cross AZ• Keep SPOFs in one AZ

Questions?Andrew Shieh, Sunnyvale, CAshandrew@smugmug.com@shandrew

http://www.smugmug.com/ http://pics.shieh.info/

Thank you!

Please give us your feedback on this presentation

As a thank you, we will select prize winners daily for completed surveys!

ARC312 - SmugMug’s Zero Downtime Migration to AWS

Thank You

SmugMug's Zero-Downtime Migration to AWS (ARC312) | AWS re:Invent 2013

Technology

AWS re:Invent 2013 Recap

NetApp Private Storage for AWS (ENT216) | AWS re:Invent 2013

AWS CloudFormation under the Hood (DMG303) | AWS re:Invent 2013

(APP308) Chef on AWS: Deep Dive | AWS re:Invent 2014

AWS re:Invent 2016 Fast Forward

AWS Security Ideas - re:Invent 2016

Understanding AWS Database Options (DAT201) | AWS re:Invent 2013

Navigating AWS re:Invent 2015

AWS re:Invent 2016: The Effective AWS CLI User (DEV402)

Continuous Deployment @ AWS Re:Invent

AWS re:Invent 2016 - Scality's Open Source AWS S3 Server

AWS OpsWorks Under the Hood (DMG304) | AWS re:Invent 2013

Overview of Windows on AWS (CPN206) | AWS re:Invent 2013

(SEC201) AWS Security Keynote Address | AWS re:Invent 2014

Migrating My.T-Mobile.com to AWS (ENT214) | AWS re:Invent 2013

AWS re:Invent 2016: AWS Partners and Data Privacy (GPST303)

(ARC312) Processing Money in the Cloud | AWS re:Invent 2014

AWS Security – Keynote Address (SEC101) | AWS re:Invent 2013

(WEB305) Migrating Your Website to AWS | AWS re:Invent 2014

Zero to Sixty: AWS CloudFormation (DMG201) | AWS re:Invent 2013