The SmugMug Tale Presentation

Embed Size (px)

Citation preview

  • 8/9/2019 The SmugMug Tale Presentation

    1/33

    The SmugMug Tale

    http://cmac.smugmug.com/gallery/2504559%23131487110http://cmac.smugmug.com/gallery/2504559%23131487110
  • 8/9/2019 The SmugMug Tale Presentation

    2/33

    Premium photo & video sharing.

    Bootstrapped in 02.

    $10M+ as of 07.

    Profitable.

    No debt.

    Top 400 website.

    Doubling yearly.

    Who are we?

    http://cmac.smugmug.com/gallery/2504559%23131487110http://cmac.smugmug.com/gallery/2504559%23131487110
  • 8/9/2019 The SmugMug Tale Presentation

    3/33

    Premium means more and better.

    Unlimited storage.

    Unlimited bandwidth.

    Big photos (48Mpix). 500M+ of them.

    Big video (1920x180p).

    Lots of photos per page.

    Super fast.

    Our challenge

  • 8/9/2019 The SmugMug Tale Presentation

    4/33

    LAMP(hp).

    x86 (mostly AMD) on Linux (~300 4+ core hosts?)

    4 datacenters: 2 x SV, 1 x VA, 1 x SEA

    2 Ops guys. :)

    Majority of boxes are diskless.

    Consume lots of cloud services (S3, EC2, etc).

    Architecture overview

  • 8/9/2019 The SmugMug Tale Presentation

    5/33

    Binary data (photos, video, etc):

    Stored in Amazons S3. PBs.

    Akamai fronts for caching and acceleration.

    Structured data (Database, etc):

    MySQL (InnoDB mostly).

    4+ cores, 64GB, >2TB storage

    Memcached fronts for caching.

    Storage

  • 8/9/2019 The SmugMug Tale Presentation

    6/33

    Photo & video processing / encoding:

    Handled in Amazon EC2.

    Totally autonomous scaling (SkyNet)

    Customer facing:

    Diskless web boxes (PXE boot)

    Scaled up *and* out MySQL

    Memcached ~1TB

    Compute

  • 8/9/2019 The SmugMug Tale Presentation

    7/33

    Super-fast CDN:

    Reads often already close to customer.

    More than just a CDN:

    HTML/AJAX/etc inspection for pre-fetch

    Anticipate requests and get data to within low ms

    Optimal data path to SmugMug

    DNS latency reduction

    $$$ but worth it. Get what you pay for.

    Secret Weapon: Akamai

  • 8/9/2019 The SmugMug Tale Presentation

    8/33

    Screaming fast.

    ~1TB of data stored.

    >96% hit rate

    Contains MySQL row data, avoid SELECTs

    Misc other data cached, but MySQL biggest win

    Fall back on MySQL for cold data

    Secret Weapon: memcached

  • 8/9/2019 The SmugMug Tale Presentation

    9/33

    Most important technology at SmugMug.

    Super dependent on replication:

    Performance

    Reliability / High Availability

    No MySQL data loss in >7 years.

    No JOINs. (Or lots of 4.x+ features, either)

    Vertically partitioned, not horizontally (no shards)

    Secret Weapon: MySQL

  • 8/9/2019 The SmugMug Tale Presentation

    10/33

    Most important technology at SmugMug.

    Huge thanks to Heikki, Oracle, Percona and Google!

    Running 1.0.3+patches in production.

    Big performance gains with recent releases.

    Secret Weapon: InnoDB

  • 8/9/2019 The SmugMug Tale Presentation

    11/33

    Crazy concentration of talent under one roof.

    Best MySQL dollars weve ever spent.

    Helped us out of a major bind

    Have you heard of the back_log mysqld setting?

    Me neither. Hope you never do. Percona had.

    Helped build, integrate, and test InnoDB patches.

    Secret Weapon: Percona

  • 8/9/2019 The SmugMug Tale Presentation

    12/33

    We care about write latency above all.

    Well, ok, maybe data integrity. ;)

    Scaling reads easy: replication and memcached.

    Replication needs to stay current (

  • 8/9/2019 The SmugMug Tale Presentation

    13/33

    Mostly SELECT pkey FROM table WHERE index;

    On cache miss, SELECT * FROM table WHERE pkey;

    UPDATEs/DELETEs mostly on single rows by pkey

    Easy memcached expiration.

    Easy slave-delay tracking.

    Very denormalized.

    No JOINs or complex SELECTs.

    OLTP benchmark imperfect. Time for sysbench-web?

    MySQL query details

  • 8/9/2019 The SmugMug Tale Presentation

    14/33

    Better filesystem:

    CentOS Linux shop (lots of expertise).

    MySQL is storage intensive (iops, size, etc).

    ext3 old and busted. fsck, well, sucks.

    ext4 already old and busted. :(

    Want good volume management.

    Serialized writes (non-parallel). Ugh.

    MySQL Issues: Filesystems

  • 8/9/2019 The SmugMug Tale Presentation

    15/33

  • 8/9/2019 The SmugMug Tale Presentation

    16/33

    We run Linux.

    ZFS doesnt run on Linux.

    Crap.

    The REAL Issue

  • 8/9/2019 The SmugMug Tale Presentation

    17/33

    Unknown state on crash:

    Did *.info get written at commit?

    Or is it *2 months* out of date?

    Bringing TB+ slaves online quickly.

    Backups using LVM/ZFS a pain.

    Keeping up with master.

    Single thread for replication SQL.

    Master promotion cludgy.

    MySQL Issues: Replication

  • 8/9/2019 The SmugMug Tale Presentation

    18/33

    Transactional replication patches:

    Slave always in known state.

    Either ok to bring back up or CHANGE MASTER.

    Safe to take snapshots anytime, no effort.

    Safe to use innodb_flush_log_at_trx_commit=2

    InnoDB only. Stopgap. Global trx IDs better.

    Using in pre-production. Production next week?

    Replication solutions

  • 8/9/2019 The SmugMug Tale Presentation

    19/33

    Toro aka S7410.

    NAS storage with a few twists.

    2 x Quad-Core Opteron + 64GB RAM

    100MB Readzilla SSD

    2 x 18GB Writezilla SSD. 20K write iops.

    22 x 1TB 7200rpm HDD

    Clustered HA configuration.

    Secret Weapon: Sushi

  • 8/9/2019 The SmugMug Tale Presentation

    20/33

    ZFS on Linux!

    SSD is here!

    SSD performance is cheap!

    Consume via NFS, iSCSI, CIFS, HTTP, FTP, etc.

    Massive flexibility - no more DAS.

    Fishworks interface is a dream.

    Analytics is a game changer.

    Mmm, Toro tastes good.

  • 8/9/2019 The SmugMug Tale Presentation

    21/33

    Initial sticker shock - $80K?! $142K clustered?!

    No one pays list price. Whew.

    Startup Essentials. Double-whew.

    Paradigm shift. Biggest whew!

    DAS -> NAS

    So much IO, in theory, can stack lots of clients.

    In practice, can stack *lots* of clients.

    We now have 5 clustered configs. :)

    Sushis quite reasonable

  • 8/9/2019 The SmugMug Tale Presentation

    22/33

    Crazy fast. 9.6K iops, 4.5K under 43us, 8K under 166us

    Sushi served fast

  • 8/9/2019 The SmugMug Tale Presentation

    23/33

    Scalable. 15K 4k write iops w/16 threads.

    Low latency. ~250us @ 3K iops, ~700us @ 10K

    Sushi served fast

    0

    5000

    10000

    15000

    20000

    1 2 4 8 16 32

    4

    K

    writeiops

    threads

    fio write benchmark

  • 8/9/2019 The SmugMug Tale Presentation

    24/33

    So fast, were stacking like crazy.

    5 different MySQL workloads on single clustered Toro.

    8 slaves on single Toro.

    Each used to have 15K disks + write cache.

    Lots of excess io and space capacity still.

    Compression for free (no client CPU usage)

    Crazy fast

    ~1.5X ratio across TBs of InnoDB

    Sushi today

  • 8/9/2019 The SmugMug Tale Presentation

    25/33

    Backups a breeze.

    Automatic snapshots every n minutes / hours / days.

    No need to LOCK / shutdown / STOP SLAVE / etc

    Rollback anytime. Skip bad SQL statements.

    New slave? Click snapshot. Click clone. Done.

    Slaves share unchanged data on disk and in RAM.

    Future bright: clone + de-dupe = insanely efficient.

    Sushi today

  • 8/9/2019 The SmugMug Tale Presentation

    26/33

    DTrace on Linux!

    Never had analytics on storage before.

    Vendor used to say: Um, we dunno. Buy more spindles?

    Now I know all.

    Vendor now says: What does Analytics say?

    Drill down on everything. Correlate anything.

    God-like power.

    Analyzing sushi

  • 8/9/2019 The SmugMug Tale Presentation

    27/33

    NFSv3 (rather than v4)

    16KB record size in ZFS (InnoDB)

    Mirrored (RAID1+0) disks w/striped Logzilla

    MySQL concurrency bound - cant use all the I/O

    If compressing, use LZJB.

    In theory, can optimize InnoDB:

    doublewrite = 0, checksums = 0. ZFS does these.

    In practice, no big gain with our workload.

    MySQL on Toro so far

  • 8/9/2019 The SmugMug Tale Presentation

    28/33

    Replication *.info files not syncd over NFS

    Found a slave with *2 month old* info files

    Transactional replication to the rescue!

    NFS locking and InnoDB

    Warnings on the Net. No hard data.

    Actively researching. Whats the problem?

    MySQL on Toro problems

  • 8/9/2019 The SmugMug Tale Presentation

    29/33

    10GbE for reduced latency?

    Actively testing this.

    Driver tuning required. Defaults for throughput.

    Cards (Intel) & switches (Arista) cheap & fast

    Less than $500/port.

    Copper twinax SFP+ cables cheap. Optical XFP not.

    $50 vs $1000+

    Toro doesnt support SFP+ cards yet. :(

    Even faster?

  • 8/9/2019 The SmugMug Tale Presentation

    30/33

    Everything runs better on Toro. :)

    Revision control.

    Stateless Linux mounts.

    Email.

    Developer home directories.

    Built-in, automatic replication for multi-site backups.

    Photo and video serving?

    Kitchen sink on Toro

  • 8/9/2019 The SmugMug Tale Presentation

    31/33

    100% SSD.

    Still too $$ for TB+ installs.

    Even better InnoDB.

    Community on fire. Oracle/MySQL accepting patches!

    Multi-threaded replication.

    Preview release is out. Yes!

    New storage engines

    PBXT, Falcon, Maria, oh my!

    The future?

  • 8/9/2019 The SmugMug Tale Presentation

    32/33

    MySQL is a crown jewel.

    Not a gateway drug to Oracle. Different customers.

    Kill btrfs. GPL ZFS.

    MySQL and InnoDB under one roof = opportunity.

    OpenStorage is game changer. Dont kill it.

    Listen to your new communities.

    Im busy. Im up here because this is important.

    Oracle wishlist

  • 8/9/2019 The SmugMug Tale Presentation

    33/33

    Thanks!

    Blog: http://blogs.smugmug.com/don

    Twitter: DonMacAskill

    Email: [email protected]

    Percona Conference: Upstairs :)

    http://blogs.smugmug.com/onethumbhttp://blogs.smugmug.com/onethumbhttp://blogs.smugmug.com/onethumbhttp://blogs.smugmug.com/onethumbmailto:[email protected]:[email protected]://blogs.smugmug.com/onethumbhttp://blogs.smugmug.com/onethumb