33
The SmugMug Tale

The SmugMug Tale - O'Reilly Mediaassets.en.oreilly.com/1/event/21/The SmugMug Tale Presentation.pdf · The SmugMug Tale. Premium photo ... (Arista) cheap & fast Less than $500/port

Embed Size (px)

Citation preview

Page 2: The SmugMug Tale - O'Reilly Mediaassets.en.oreilly.com/1/event/21/The SmugMug Tale Presentation.pdf · The SmugMug Tale. Premium photo ... (Arista) cheap & fast Less than $500/port

Premium photo & video sharing.

Bootstrapped in ’02.

$10M+ as of ’07.

Profitable.

No debt.

Top 400 website.

Doubling yearly.

Who are we?

Page 3: The SmugMug Tale - O'Reilly Mediaassets.en.oreilly.com/1/event/21/The SmugMug Tale Presentation.pdf · The SmugMug Tale. Premium photo ... (Arista) cheap & fast Less than $500/port

Premium means “more” and “better”.

Unlimited storage.

Unlimited bandwidth.

Big photos (48Mpix). 500M+ of them.

Big video (1920x180p).

Lots of photos per page.

Super fast.

Our challenge

Page 4: The SmugMug Tale - O'Reilly Mediaassets.en.oreilly.com/1/event/21/The SmugMug Tale Presentation.pdf · The SmugMug Tale. Premium photo ... (Arista) cheap & fast Less than $500/port

LAMP(hp).

x86 (mostly AMD) on Linux (~300 4+ core hosts?)

4 datacenters: 2 x SV, 1 x VA, 1 x SEA

2 Ops guys. :)

Majority of boxes are diskless.

Consume lots of cloud services (S3, EC2, etc).

Architecture overview

Page 5: The SmugMug Tale - O'Reilly Mediaassets.en.oreilly.com/1/event/21/The SmugMug Tale Presentation.pdf · The SmugMug Tale. Premium photo ... (Arista) cheap & fast Less than $500/port

Binary data (photos, video, etc):

Stored in Amazon’s S3. PBs.

Akamai fronts for caching and acceleration.

Structured data (Database, etc):

MySQL (InnoDB mostly).

4+ cores, 64GB, >2TB storage

Memcached fronts for caching.

Storage

Page 6: The SmugMug Tale - O'Reilly Mediaassets.en.oreilly.com/1/event/21/The SmugMug Tale Presentation.pdf · The SmugMug Tale. Premium photo ... (Arista) cheap & fast Less than $500/port

Photo & video processing / encoding:

Handled in Amazon EC2.

Totally autonomous scaling (SkyNet)

Customer facing:

Diskless web boxes (PXE boot)

Scaled up *and* out MySQL

Memcached ~1TB

Compute

Page 7: The SmugMug Tale - O'Reilly Mediaassets.en.oreilly.com/1/event/21/The SmugMug Tale Presentation.pdf · The SmugMug Tale. Premium photo ... (Arista) cheap & fast Less than $500/port

Super-fast CDN:

Reads often already close to customer.

More than just a CDN:

HTML/AJAX/etc inspection for pre-fetch

Anticipate requests and get data to within low ms

Optimal data path to SmugMug

DNS latency reduction

$$$ but worth it. Get what you pay for.

Secret Weapon: Akamai

Page 8: The SmugMug Tale - O'Reilly Mediaassets.en.oreilly.com/1/event/21/The SmugMug Tale Presentation.pdf · The SmugMug Tale. Premium photo ... (Arista) cheap & fast Less than $500/port

Screaming fast.

~1TB of data stored.

>96% hit rate

Contains MySQL row data, avoid SELECTs

Misc other data cached, but MySQL biggest win

Fall back on MySQL for cold data

Secret Weapon: memcached

Page 9: The SmugMug Tale - O'Reilly Mediaassets.en.oreilly.com/1/event/21/The SmugMug Tale Presentation.pdf · The SmugMug Tale. Premium photo ... (Arista) cheap & fast Less than $500/port

Most important technology at SmugMug.

Super dependent on replication:

Performance

Reliability / High Availability

No MySQL data loss in >7 years.

No JOINs. (Or lots of 4.x+ features, either)

Vertically partitioned, not horizontally (no shards)

Secret Weapon: MySQL

Page 10: The SmugMug Tale - O'Reilly Mediaassets.en.oreilly.com/1/event/21/The SmugMug Tale Presentation.pdf · The SmugMug Tale. Premium photo ... (Arista) cheap & fast Less than $500/port

Most important technology at SmugMug.

Huge thanks to Heikki, Oracle, Percona and Google!

Running 1.0.3+patches in production.

Big performance gains with recent releases.

Secret Weapon: InnoDB

Page 11: The SmugMug Tale - O'Reilly Mediaassets.en.oreilly.com/1/event/21/The SmugMug Tale Presentation.pdf · The SmugMug Tale. Premium photo ... (Arista) cheap & fast Less than $500/port

Crazy concentration of talent under one roof.

Best MySQL dollars we’ve ever spent.

Helped us out of a major bind

Have you heard of the ‘back_log’ mysqld setting?

Me neither. Hope you never do. Percona had.

Helped build, integrate, and test InnoDB patches.

Secret Weapon: Percona

Page 12: The SmugMug Tale - O'Reilly Mediaassets.en.oreilly.com/1/event/21/The SmugMug Tale Presentation.pdf · The SmugMug Tale. Premium photo ... (Arista) cheap & fast Less than $500/port

We care about write latency above all.

Well, ok, maybe data integrity. ;)

Scaling reads “easy”: replication and memcached.

Replication needs to stay current (<1 sec).

MySQL concurrency problems. (Much improved!)

Parallel I/O - lots of cores.

Large storage (TBs).

Big RAM (64GB+) to keep indexes hot.

MySQL details

Page 13: The SmugMug Tale - O'Reilly Mediaassets.en.oreilly.com/1/event/21/The SmugMug Tale Presentation.pdf · The SmugMug Tale. Premium photo ... (Arista) cheap & fast Less than $500/port

Mostly SELECT pkey FROM table WHERE index;

On cache miss, SELECT * FROM table WHERE pkey;

UPDATEs/DELETEs mostly on single rows by pkey

Easy memcached expiration.

Easy slave-delay tracking.

Very denormalized.

No JOINs or complex SELECTs.

OLTP benchmark imperfect. Time for sysbench-web?

MySQL query details

Page 14: The SmugMug Tale - O'Reilly Mediaassets.en.oreilly.com/1/event/21/The SmugMug Tale Presentation.pdf · The SmugMug Tale. Premium photo ... (Arista) cheap & fast Less than $500/port

Better filesystem:

CentOS Linux shop (lots of expertise).

MySQL is storage intensive (iops, size, etc).

ext3 old and busted. fsck, well, sucks.

ext4 already old and busted. :(

Want good volume management.

Serialized writes (non-parallel). Ugh.

MySQL Issues: Filesystems

Page 15: The SmugMug Tale - O'Reilly Mediaassets.en.oreilly.com/1/event/21/The SmugMug Tale Presentation.pdf · The SmugMug Tale. Premium photo ... (Arista) cheap & fast Less than $500/port

Transactional.

Copy-on-write.

End-to-end data integrity.

On-the-fly corruption detection & repair.

Integrated volume management.

Snapshots & clones.

Open source software.

Filesystem Solution - ZFS!

Page 16: The SmugMug Tale - O'Reilly Mediaassets.en.oreilly.com/1/event/21/The SmugMug Tale Presentation.pdf · The SmugMug Tale. Premium photo ... (Arista) cheap & fast Less than $500/port

We run Linux.

ZFS doesn’t run on Linux.

Crap.

The REAL Issue

Page 17: The SmugMug Tale - O'Reilly Mediaassets.en.oreilly.com/1/event/21/The SmugMug Tale Presentation.pdf · The SmugMug Tale. Premium photo ... (Arista) cheap & fast Less than $500/port

Unknown state on crash:

Did *.info get written at commit?

Or is it *2 months* out of date?

Bringing TB+ slaves online quickly.

Backups using LVM/ZFS a pain.

Keeping up with master.

Single thread for replication SQL.

Master promotion cludgy.

MySQL Issues: Replication

Page 18: The SmugMug Tale - O'Reilly Mediaassets.en.oreilly.com/1/event/21/The SmugMug Tale Presentation.pdf · The SmugMug Tale. Premium photo ... (Arista) cheap & fast Less than $500/port

Transactional replication patches:

Slave always in known state.

Either ok to bring back up or CHANGE MASTER.

Safe to take snapshots anytime, no effort.

Safe to use innodb_flush_log_at_trx_commit=2

InnoDB only. Stopgap. Global trx IDs better.

Using in pre-production. Production next week?

Replication solutions

Page 19: The SmugMug Tale - O'Reilly Mediaassets.en.oreilly.com/1/event/21/The SmugMug Tale Presentation.pdf · The SmugMug Tale. Premium photo ... (Arista) cheap & fast Less than $500/port

Toro aka S7410.

NAS storage with a few twists.

2 x Quad-Core Opteron + 64GB RAM

100MB Readzilla SSD

2 x 18GB Writezilla SSD. 20K write iops.

22 x 1TB 7200rpm HDD

Clustered HA configuration.

Secret Weapon: Sushi

Page 20: The SmugMug Tale - O'Reilly Mediaassets.en.oreilly.com/1/event/21/The SmugMug Tale Presentation.pdf · The SmugMug Tale. Premium photo ... (Arista) cheap & fast Less than $500/port

ZFS on Linux!

SSD is here!

SSD performance is cheap!

Consume via NFS, iSCSI, CIFS, HTTP, FTP, etc.

Massive flexibility - no more DAS.

Fishworks interface is a dream.

Analytics is a game changer.

Mmm, Toro tastes good.

Page 21: The SmugMug Tale - O'Reilly Mediaassets.en.oreilly.com/1/event/21/The SmugMug Tale Presentation.pdf · The SmugMug Tale. Premium photo ... (Arista) cheap & fast Less than $500/port

Initial sticker shock - $80K?! $142K clustered?!

No one pays list price. Whew.

Startup Essentials. Double-whew.

Paradigm shift. Biggest whew!

DAS -> NAS

So much IO, in theory, can “stack” lots of clients.

In practice, can stack *lots* of clients.

We now have 5 clustered configs. :)

Sushi’s quite reasonable

Page 22: The SmugMug Tale - O'Reilly Mediaassets.en.oreilly.com/1/event/21/The SmugMug Tale Presentation.pdf · The SmugMug Tale. Premium photo ... (Arista) cheap & fast Less than $500/port

Crazy fast. 9.6K iops, 4.5K under 43us, 8K under 166us

Sushi served fast

Page 23: The SmugMug Tale - O'Reilly Mediaassets.en.oreilly.com/1/event/21/The SmugMug Tale Presentation.pdf · The SmugMug Tale. Premium photo ... (Arista) cheap & fast Less than $500/port

Scalable. 15K 4k write iops w/16 threads.

Low latency. ~250us @ 3K iops, ~700us @ 10K

Sushi served fast

0

5000

10000

15000

20000

1 2 4 8 16 32

4K w

rite

iops

threads

fio write benchmark

Page 24: The SmugMug Tale - O'Reilly Mediaassets.en.oreilly.com/1/event/21/The SmugMug Tale Presentation.pdf · The SmugMug Tale. Premium photo ... (Arista) cheap & fast Less than $500/port

So fast, we’re stacking like crazy.

5 different MySQL workloads on single clustered Toro.

8 slaves on single Toro.

Each used to have 15K disks + write cache.

Lots of excess io and space capacity still.

Compression “for free” (no client CPU usage)

Crazy fast

~1.5X ratio across TBs of InnoDB

Sushi today

Page 25: The SmugMug Tale - O'Reilly Mediaassets.en.oreilly.com/1/event/21/The SmugMug Tale Presentation.pdf · The SmugMug Tale. Premium photo ... (Arista) cheap & fast Less than $500/port

Backups a breeze.

Automatic snapshots every n minutes / hours / days.

No need to LOCK / shutdown / STOP SLAVE / etc

Rollback anytime. Skip bad SQL statements.

New slave? Click snapshot. Click clone. Done.

Slaves share unchanged data on disk and in RAM.

Future bright: clone + de-dupe = insanely efficient.

Sushi today

Page 26: The SmugMug Tale - O'Reilly Mediaassets.en.oreilly.com/1/event/21/The SmugMug Tale Presentation.pdf · The SmugMug Tale. Premium photo ... (Arista) cheap & fast Less than $500/port

DTrace on Linux!

Never had analytics on storage before.

Vendor used to say: “Um, we dunno. Buy more spindles?”

Now I know all.

Vendor now says: “What does Analytics say?”

Drill down on everything. Correlate anything.

God-like power.

Analyzing sushi

Page 27: The SmugMug Tale - O'Reilly Mediaassets.en.oreilly.com/1/event/21/The SmugMug Tale Presentation.pdf · The SmugMug Tale. Premium photo ... (Arista) cheap & fast Less than $500/port

NFSv3 (rather than v4)

16KB record size in ZFS (InnoDB)

Mirrored (RAID1+0) disks w/striped Logzilla

MySQL concurrency bound - can’t use all the I/O

If compressing, use LZJB.

In theory, can optimize InnoDB:

doublewrite = 0, checksums = 0. ZFS does these.

In practice, no big gain with our workload.

MySQL on Toro so far

Page 28: The SmugMug Tale - O'Reilly Mediaassets.en.oreilly.com/1/event/21/The SmugMug Tale Presentation.pdf · The SmugMug Tale. Premium photo ... (Arista) cheap & fast Less than $500/port

Replication *.info files not sync’d over NFS

Found a slave with *2 month old* info files

Transactional replication to the rescue!

NFS locking and InnoDB

Warnings on the Net. No hard data.

Actively researching. What’s the problem?

MySQL on Toro problems

Page 29: The SmugMug Tale - O'Reilly Mediaassets.en.oreilly.com/1/event/21/The SmugMug Tale Presentation.pdf · The SmugMug Tale. Premium photo ... (Arista) cheap & fast Less than $500/port

10GbE for reduced latency?

Actively testing this.

Driver tuning required. Defaults for throughput.

Cards (Intel) & switches (Arista) cheap & fast

Less than $500/port.

Copper twinax SFP+ cables cheap. Optical XFP not.

$50 vs $1000+

Toro doesn’t support SFP+ cards yet. :(

Even faster?

Page 30: The SmugMug Tale - O'Reilly Mediaassets.en.oreilly.com/1/event/21/The SmugMug Tale Presentation.pdf · The SmugMug Tale. Premium photo ... (Arista) cheap & fast Less than $500/port

Everything runs better on Toro. :)

Revision control.

Stateless Linux mounts.

Email.

Developer home directories.

Built-in, automatic replication for multi-site backups.

Photo and video serving?

Kitchen sink on Toro

Page 31: The SmugMug Tale - O'Reilly Mediaassets.en.oreilly.com/1/event/21/The SmugMug Tale Presentation.pdf · The SmugMug Tale. Premium photo ... (Arista) cheap & fast Less than $500/port

100% SSD.

Still too $$ for TB+ installs.

Even better InnoDB.

Community on fire. Oracle/MySQL accepting patches!

Multi-threaded replication.

Preview release is out. Yes!

New storage engines

PBXT, Falcon, Maria, oh my!

The future?

Page 32: The SmugMug Tale - O'Reilly Mediaassets.en.oreilly.com/1/event/21/The SmugMug Tale Presentation.pdf · The SmugMug Tale. Premium photo ... (Arista) cheap & fast Less than $500/port

MySQL is a crown jewel.

Not a gateway drug to Oracle. Different customers.

Kill btrfs. GPL ZFS.

MySQL and InnoDB under one roof = opportunity.

OpenStorage is game changer. Don’t kill it.

Listen to your new communities.

I’m busy. I’m up here because this is important.

Oracle wishlist