The Shard Revisited: Tools and Techniques Used at Etsy

jgoulah@etsy.com / @johngoulah

The Shard RevisitedTools and Techniques Used at Etsy

Tuesday, November 12, 13

A marketplace for people around the world to connect, buy, and sell unique goodsEtsy is the marketplace that we all make together, and our mission is to re-imagine commerce in ways that build a more fulfilling and lasting world

1.5B+ page views / mo.

895MM sales in 2012

60MM+ unique visitors/mo.

1M+ shops / 200 countries

this talk consists of the architecture, our dev data problem/solution, and other toolsbig cluster, 35 shards

100K+ queries/sec avg

6TB InnoDB buffer pool

30TB+ data stored

99.9% queries under 1ms

~1.8Gbps outbound (plain text)

1/3 RAM not dedicated to the pool (OS, disk, network buffers, etc)

~100 MySQL servers1100 15K rpm disks / 1600+ CPU’s

Server SpecHP DL 380 G8

96GB RAM16 spindles / 2TB RAID 10

24 CoreTuesday, November 12, 13

16 x 146GB

Architecture

2 key concerns when you reach scale....

Redundancy

the duplication of critical components of a system with the intention of increasing reliabilityexample: jet engines

Master - Master

R/W R/W

duplication of critical components....

Master - Master

R/W R/W

Side A Side BTuesday, November 12, 13

we call these sides “replicants”

Scalability

the ability of a system to handle growing amount of work in a capable manner(grocery store example)

shard 1 shard 2 shard N

horizontal scaling

shard N + 1

horizontal scaling

shard N + 1

Migrate Migrate Migrate

horizontal scaling

Bird’s-Eye View

http://www.flickr.com/photos/feuilllu/36612719/sizes/l/in/photostream/

tickets index

3 main components

couple others, dbaux, dbtasks

tickets index

Unique IDs

tickets index

Shard Lookup

tickets index

Store/Retrieve Data

Basics

what is sharding?

users_groups

user_id group_id

users_groups

user_id group_id

creating horizontal partitions from a table

users_groups

user_id group_id

users_groups

user_id group_id

shard 1

shard 2

Index Servers

have to be able to find the data, these simply exist to look up where the data isto answer the question: what shard is the data on?

http://www.flickr.com/photos/mamsy/4175783446/sizes/l/in/photostream/

want to find details for a user

select shard_id from user_index where user_id = X

first get the shard id, have the PK

select shard_id from user_index where user_id = X

returns 1

select join_date from users where user_id = X

returns 2012-02-05

Ticket ServersTuesday, November 12, 13

http://www.flickr.com/photos/rexroof/5126088323/sizes/l/in/photostream/

Globally Unique ID

can’t use auto-increment with distributed system, hand out globally unique id’s

CREATE TABLE `tickets` ( `id` bigint(20) unsigned NOT NULL auto_increment, `stub` char(1) NOT NULL default '', PRIMARY KEY (`id`), UNIQUE KEY `stub` (`stub`)) ENGINE=MyISAM

only myisam tables, leverage myisam engine's lack of concurrency

REPLACE INTO tickets (stub) VALUES ('a');SELECT LAST_INSERT_ID();

Ticket Generation

since value ‘a’ exists, it replaces the row with the same value (and bumps the id)

if an old row in the table has the same value as a new row for aPK or a UNIQUE index, the old row is deleted before the new row is inserted

REPLACE INTO tickets (stub) VALUES ('a');SELECT LAST_INSERT_ID();

SELECT * FROM tickets;

Ticket Generation

id stub

4589294 a

auto-increment-increment = 2auto-increment-offset = 1

tickets A

tickets B

ODD:offset=1EVEN: offset=2 http://openclipart.org/detail/94723/database-symbol-by-rg1024

tickets A

tickets B

NOT master-masterTuesday, November 12, 13

failure is ok, only lose last ticket idcan bring another server up with new offset

http://openclipart.org/detail/94723/database-symbol-by-rg1024

Shards

shards hold the majority of the data

http://www.flickr.com/photos/merrickb/63999750/sizes/o/in/photostream/

Object Hashing....aka pinning data to one side of the shard

after we determine the shard we have to determine side A or side B given the replicant indexalso helps keep connections to a (relative) minimum since all stuff sharded by a specific instance will then pick the same side

user_id : 500

so we know the shard, now which replicantobject id in this case is user_idside a/b are replicants

user_id : 500 % (# active replicants)

'etsy_index_A' => 'mysql:host=dbindex01.ny4.etsy.com;port=3306;dbname=etsy_index;user=etsy_rw', 'etsy_index_B' => 'mysql:host=dbindex02.ny4.etsy.com;port=3306;dbname=etsy_index;user=etsy_rw', 'etsy_shard_001_A' => 'mysql:host=dbshard01.ny4.etsy.com;port=3306;dbname=etsy_shard;user=etsy_rw', 'etsy_shard_001_B' => 'mysql:host=dbshard02.ny4.etsy.com;port=3306;dbname=etsy_shard;user=etsy_rw', 'etsy_shard_002_A' => 'mysql:host=dbshard03.ny4.etsy.com;port=3306;dbname=etsy_shard;user=etsy_rw', 'etsy_shard_002_B' => 'mysql:host=dbshard04.ny4.etsy.com;port=3306;dbname=etsy_shard;user=etsy_rw', 'etsy_shard_003_A' => 'mysql:host=dbshard05.ny4.etsy.com;port=3306;dbname=etsy_shard;user=etsy_rw', 'etsy_shard_003_B' => 'mysql:host=dbshard06.ny4.etsy.com;port=3306;dbname=etsy_shard;user=etsy_rw',

each master master pair in the config

user_id : 500 % (2)

user_id : 500 % (2) == 0

user_id : 500 % (2) == 0 select ... insert ...update ...

user_id : 500 % (2) == 0user_id : 501 % (2) == 1

select ... insert ...update ...

500select ... insert ...update ...

501A B

FailureTuesday, November 12, 13

http://www.flickr.com/photos/44124348109@N01/6467405231/

user_id : 500 % (2) == 0user_id : 501 % (2) == 1

user_id : 500 % (1) == 0user_id : 501 % (1) == 0

VariantsTuesday, November 12, 13

variants are mirrors of the same data in different tables

http://www.flickr.com/photos/garibaldi/522196113/sizes/o/in/photostream/

shard 1

user_id group_id

shard 2

user_id group_id

SELECT user_id FROM users_groups WHERE group_id = ‘A’

shard 1

user_id group_id

shard 2

user_id group_id

SELECT user_id FROM users_groups WHERE group_id = ‘A’Broken!

shard 1

user_id group_id

shard 2

user_id group_id

SELECT user_id FROM users_groups WHERE group_id = ‘A’Broken!

users_groupsuser_id group_id

groups_usersgroup_id user_id

mirror the data, map users to groups, groups to users

indexuser_id shard_id

group_id shard_id

users_groups_index groups_users_index

separate indexes for different slices of data

indexuser_id shard_id

group_id shard_id

users_groups_index groups_users_index

shard 3user_id group_id

look up the groups a user is part of

Dev Data

now lets talk about development data

The Problem

hit this a few years ago, every big company probably has this issue

sync prod to dev, until prod data gets too big

http://www.flickr.com/photos/uwwresnet/6280880034/sizes/l/in/photostream/

Some Approaches

subsets of data

generated data

subsets have to end somewhere (a shop has favorites that are connected to people, connected to shops, etc)generated data can be time consuming to fake

But...

but there is a problem with both of those approaches

Edge CasesTuesday, November 12, 13

what about testing edge cases, difficult to diagnose bugs?hard to model the same data set that produced a user facing bug

http://www.flickr.com/photos/kalexanderson/6199793967/sizes/o/in/photostream/

Complexity

another issue is testing problems at scale, complex and large gobs of datareal social network ecosystem can be difficult to generate (favorites, follows) (activity feed, “similar items” search gives better results in prod)

http://www.flickr.com/photos/doug88888/4687906267/sizes/o/in/photostream/

Copy prod data to dev ?

what most people do before data gets too big, almost 3 days to sync 30Tb over 1Gbps link, close to 10 hrs over 10Gbps bringing prod dataset to dev was expensive hardware/maint, keeping parity with prod, and applying schema changes would take at least as long

Use Production

(sometimes)

instead....

so we did what we saw as the last resort - used production not for greenfield development, more for mature features and diagnosing bugswe still have a dev database but the data is sparse and unreliable

goes without saying this can be dangerous, and people have to be aware they are doing it

http://instagram.com/p/d8nw9aNqlt/http://www.flickr.com/photos/stuckincustoms/432361985/sizes/l/in/photostream/

dev shard

introducing....

dev shard, shard used for initial writes of data created when coming from dev env

tickets index

DEV shard

www.etsy.com www.goulah.vm

Initial Writes

DEV shard

Initial Writes

writes from etsy.com go everywhere -except- dev shard

DEV shard

Initial Writes

writes from my vm -only- go to dev shard

mysql proxy

proxy hits all of the shards/index/tickets

http://www.oreillynet.com/pub/a/databases/2007/07/12/getting-started-with-mysql-proxy.html

explicitly enabled

% dev_proxy onDev-Proxy config is now ON. Use 'dev_proxy off' to turn it off.

Not on all the time

visual notifications

notify engineers they are using the proxy, this is read-only mode

read/write mode

read-write mode, needed for login and other things that write data

% ./bin/myscript YOU CURRENTLY HAVE THE READ WRITE PROXY TURNED ON AND ARE RUNNING A CLI SCRIPT!!!You must type the phrase 'read write proxy' and press enter to continue...

known input/output

we know where all of the queries from dev originate from

http://www.flickr.com/photos/medevac71/4875526920/sizes/l/in/photostream/

dangerous/unnecessary queries

(DEV) etsy_rw@jgoulah [test]> select * from fred_test;

ERROR 9001 (E9001): Selects from tables must have where clauses

-- filter dangerous queries - (queries without a WHERE)-- remove unnecessary queries - (instead of DELETE, have a flag, ALTER statements don’t run from dev)

logging

basics of anomaly detection is log collection

2013-04-22 18:05:43 485370821 devproxy --

/* DEVPROXY source=10.101.194.19:40198

uuid=c309e8db-ca32-4171-9c4a-6c37d9dd3361

[htSp8458VmHlC] [etsy_index_B] [browse.php] */

SELECT id FROM table;

date thread id

source ip

unique id generated by proxy

app request id dest. shard script

stealth data

hiding data from users (favorites go on dev and prod shard, making sure test user/shops don’t show up in search)

http://www.flickr.com/photos/davidyuweb/8063097077/sizes/h/in/photostream/

overlays

An overlay is a local copy of production data If there are overlays in place in dev, it will send the queries to the local db instead (it does this by overriding looking up the shard on index, and checks for table/pk pair).

user_id group_id

store in memcache: <table, pk>Tuesday, November 12, 13

Any time we write to the other shards from dev, the shard migration copies to be affected rows to their local mysql instance over the dev proxyand then stores the table/pk for subsequent lookup

Delayed Slaves

pt-slave-delay watches a slave and starts and stops its replication SQL thread as necessary to hold it

http://www.flickr.com/photos/xploded/141295823/sizes/o/in/photostream/

4 hour delay behind master

produce row based binary logs

Delayed Slaves

allow for quick recovery

role of the delayed slavealso source of BCP (business continuity planning - prevention and recovery of threats)

pt-slave-delay --daemonize

--pid /var/run/pt-slave-delay.pid --log /var/log/pt-slave-delay.log

--delay 4h --interval 1m --nocontinue

last 3 options most important, 4h delay, interval is how frequently it should check whether slave should be started or stopped nocontinue - don’t continue replication normally on exit (don’t catch up with master)user/pass eliminated for brevity

R/W R/W

Shard Pair

pt-slave-delayrow based binlogs

R/W R/W

Shard Pair

VerticaParse/

Transform

in addition can use slaves to send data to other stores for offline queries1)parse each binlog file to generate sequence file of row changes2)apply the row changes to a previous set for the latest version

Schema Changes

alters take forever, lock rows being altered (this is why we have new things like online schema change)

LOTS of servers to apply changes to PLUS the alter problem

apply to a side that is inactive

Schemanator

!! explain the config push process a bitalso this is used to apply the alters

SET SQL_LOG_BIN = 0; ALTER TABLE user ....

check two things in test phase:- schema applies to blank db- table validates against our sql standards

shard migration

migration of data from one shard to another

why migrate data?

Prevent disk from filling

Prevent disk from fillingHigh traffic objects (shops, users)

high traffic == disk usage and I/O util

Prevent disk from fillingHigh traffic objects (shops, users)Shard rebalancing

rebalancing when adding new shards or shards fill unequally

users per shard

Balance

how many users on each shard

# migrate_object User 5307827 2

per object migration <object type> <object id> <shard>

# migrate_pct User 25 3 6

percentage migration <object type> <percent> <old shard> <new shard>

user_id shard_id migration_lock old_shard_id

1 1 0 0

1 1 1 0

•Lock

explain about the lock, what happens in app, reads vs. writes

1 1 1 0

•Lock•Migrate

1 1 1 0

•Lock•Migrate•Checksum

checksum is a count(*) on each table

1 1 1 0

•Lock•Migrate•Checksum

1 2 0 1

•Lock•Migrate•Checksum•Unlock

1 2 0 1

•Lock•Migrate•Checksum•Unlock•Delete (from old shard)

deletes are out of band, auto-back off by looking at connection metrics

Logical Shards

Writing data into the new shard, deleting data from the old shard and then optimizing every single table is a large amount of workInstead can run a mysql process with many databases

dbshard38mysqldb_300db_301db_302db_303db_304db_305

with this, slave replication is multiplied by the number of logical shards per box (assuming even distribution of writes)

'etsy_shard_001_A' => 'mysql:host=dbshard01.ny4.etsy.com;port=3306;dbname=etsy_shard;user=etsy_rw', 'etsy_shard_001_B' => 'mysql:host=dbshard02.ny4.etsy.com;port=3306;dbname=etsy_shard;user=etsy_rw', 'etsy_shard_002_A' => 'mysql:host=dbshard03.ny4.etsy.com;port=3306;dbname=etsy_shard;user=etsy_rw', 'etsy_shard_002_B' => 'mysql:host=dbshard04.ny4.etsy.com;port=3306;dbname=etsy_shard;user=etsy_rw',

'etsy_shard_100_A' => 'mysql:host=dbshard50.ny4.etsy.com;port=3306;dbname=etsy_shard_100;user=etsy_rw', 'etsy_shard_100_B' => 'mysql:host=dbshard51.ny4.etsy.com;port=3306;dbname=etsy_shard_100;user=etsy_rw', 'etsy_shard_101_A' => 'mysql:host=dbshard50.ny4.etsy.com;port=3306;dbname=etsy_shard_101;user=etsy_rw', 'etsy_shard_101_B' => 'mysql:host=dbshard51.ny4.etsy.com;port=3306;dbname=etsy_shard_101;user=etsy_rw',

same mysql instance different database/schema

Advantages

• multi threaded slave• simpler migrations

In MySQL 5.6 we have multi-threaded slave but it can only do parallel processing if we have multiple MySQL schemas (databases).

The cons is we have many more logical shards to maintain

Logical Shard Migrations

Lets walk through a logical shard migration...

dbshard41

db_300db_301

....db_312

dbshard42

db_300db_301

....db_312

dbshard61

dbshard62

Suppose dbshard 41/42 have shard dbs 300 - 312 and we want to move half of them to a new shard pair (61/62)

dbshard41

db_300db_301

....db_312

dbshard42

db_300db_301

....db_312

dbshard61

dbshard62

restore backup

db_300db_301

....db_312

db_300db_301

....db_312

We restore last night's backup from 41 onto 61 and 62

dbshard41

db_300db_301

....db_312

dbshard42

db_300db_301

....db_312

dbshard61

dbshard62

db_300db_301

....db_312

db_300db_301

....db_312

Set up 62 to slave from 61, and 61 to slave from 41 starting from where the backup stopped.

dbshard41

db_300db_301

....db_312

dbshard42

db_300db_301

....db_312

dbshard61

dbshard62

db_300db_301

....db_312

db_300db_301

....db_312

Once 61 and 62 are all caught up, change the config such dbshard42 is disabled, and all writes/reads go to dbshard41

dbshard41

db_300db_301

....db_312

dbshard42

db_300db_301

....db_312

dbshard61

dbshard62

db_307

....db_312

db_300db_301

....db_312

config:db_307-312change from

dbshard 41 to 61

Then change the config for db_307 through 312 on dbshard41 to point to dbshard61.

dbshard41

db_300db_301

....db_312

dbshard42

db_300db_301

....db_312

dbshard61

dbshard62

db_307

....db_312

db_300db_301

....db_312

Reset dbshard61 slave to point to dbshard62 instead. So now we have Master-Master going.

dbshard41

db_300db_301

....db_312

dbshard42

db_300db_301

....db_312

dbshard61

dbshard62

db_307

....db_312

db_307

....db_312

slaveconfig:db_307-312change from

dbshard 42 to 62

Change db_307 through 312 on dbshard42 in the config to point to dbshard62.

dbshard41

db_300db_301

....db_306

dbshard42

db_300db_301

....db_306

dbshard61

dbshard62

db_307

....db_312

db_307

....db_312

And we're done. Drop db_307 through db_312 on dbshard41/42, re-enable writes on 42

Other Tools

mysqlsummary

essentially just reformatting show processlist

% mysqlsummary.pl --host dbshard31

Details for dbshard31==================================

COMMAND SUMMARY=============== Sleep 211 (96.79%) Execute 2 (0.92%) Connect 2 (0.92%) Binlog Dump 2 (0.92%) Query 1 (0.46%)

HOST SUMMARY============ meteor03 10 (4.59%) meteor01 8 (3.67%) web0228 3 (1.38%) api05 3 (1.38%) worker05 3 (1.38%) worker12 3 (1.38%)

SCRIPT SUMMARY============== Job: ShopStats/calculate 1 (0.46%) Job: NewsFeed/refresh 1 (0.46%)

SQL SUMMARY=========== select 1 (0.46%) SELECT 1 (0.46%) SHOW 1 (0.46%)

COMMAND TIMINGS===============----------------------------------------------------------------------+ HOST: worker19, USER: , DB: 2, TIME: 4----------------------------------------------------------------------select * from activity where owner_id = 7395036 and owner_type_id = 2 and deleted = 0 and creation_time >= 1382226430 and public = 1 order by creation_time desc limit 0,50

----------------------------------------------------------------------+ HOST: worker27, USER: , DB: 2, TIME: 4----------------------------------------------------------------------SELECT * FROM shop_stats WHERE shop_id = 5902046 AND currency_code = 'USD' AND sales_year = 2012 AND id != 2432609442

ORM REPL

% php-repl[1] etsy-php> EtsyORM::getFinder('User');

→ object(EtsyModel_UserFinder)( 0 => 'countAll( SELECT count(*) FROM User )', 1 => 'findByLoginName ( $login_name )', 2 => 'findByEmail ( $primary_email )',...

we send queries over UDP from our ORM, stick them in a db and to analyze laterrequest context: request id, logged in user-id, what script is executingavoid the perf hit of slow query log, and its realtime across all shards because it originates from the client

Thank you

etsy.com/jobs

The Shard Revisited: Tools and Techniques Used at Etsy

Technology

Choosing a Shard key

Productos Vender Etsy

Distributed Systems - Brown UniversityAll Facebook Data Node A Node B Node C Shard 1 Shard 1 Shard 1 Node D Node E Node F Shard 2 Shard 2 Shard 2 Partition data into shards, maps shards

ESGF + DOCKER · ESGF/Docker Solr Cloud Architecture Shard 1 Shard 2 Shard 3 ESGF SOLR CLOUD (Jetty) Shard 1 Shard 2 Shard 3 ESGF SOLR CLOUD (Jetty) Shard 7 Shard 8 Shard 9 •Solr-Cloud

The Shard - Case Study - Allgood · The Shard - Case Study The Shard - Case Study Designed by the award-winning architect Renzo Piano, the Shard is now one of the most iconic buildings

gifprint.s3.amazonaws.com · Fnatic Corki Skin Shard Fnatic Corki Skin Shard Fnatic Corki Skin Shard Fnatic Corki Skin Shard Fnatic Corki Skin Shard Fnatic Corki Skin Fnatic Corki

Etsy Manufacturing and Macy’s Partnership Design Case ...€¦ · ETSY&MANUFACTURING&AND&MACY’S&PATNERSHIP&CASE&STUDY& 4& • Bring in the customization of products that Etsy

Postmortems at Etsy

11 Shard Price

Etsy On-Side Trading

Mobile CI at Etsy

Etsy + Kiva Zip Workshop: How to Crowdfund your 1st Etsy Startup

Scaling Deployment at Etsy

Chef at Etsy

Etsy Business Template

Etsy How-To

Jewellers paradise etsy

DevTools at Etsy

2016 Etsy Tutorial Mastering Etsy

Etsy Seller Workshop