55
Methods of Sharding MySQL Percona Live NYC 2012 Who are Palomino? Bespoke Services: we work with and like you. Production Experienced: senior DBAs, admins, and engineers. 24x7: globally-distributed on-call staff. Short-term no-lock-in contracts. Professional Services (DevOps): Chef, Puppet, Ansible. Big Data Cluster Administration (OpsDev): MySQL, PostgreSQL, Cassandra, HBase, MongoDB, Couchbase.

Methods of Sharding MySQL Percona Live NYC 2012 · Methods of Sharding MySQL Percona Live NYC 2012 Who are Palomino? Bespoke Services: we work with and like you. Production …

Embed Size (px)

Citation preview

Methods of Sharding MySQLPercona Live NYC 2012

Who are Palomino?Bespoke Services: we work with and like you.Production Experienced: senior DBAs, admins, and engineers.24x7: globally-distributed on-call staff.Short-term no-lock-in contracts.Professional Services (DevOps):

➢ Chef,➢ Puppet,➢ Ansible.

Big Data Cluster Administration (OpsDev):➢ MySQL, PostgreSQL,➢ Cassandra, HBase,➢ MongoDB, Couchbase.

Methods of Sharding MySQLPercona Live NYC 2012

Who am I?Tim EllisCTO/Principal Architect, Palomino

Achievements:➢ Palomino Big Data Strategy.➢ Datawarehouse Cluster at Riot Games.➢ Back-end Storage Architecture for Firefox Sync.➢ Led DB teams at Digg for four years.➢ Harassed the Reddit team at one of their parties.

Ensured Successful Business for:➢ Digg, Friendster,➢ Riot Games,➢ Mozilla,➢ StumbleUpon.

Methods of Sharding MySQLWhat is this Talk?

Large cluster admin: when one DB isn't enough.➢ What is a shard?➢ What shard types can I choose?➢ How to build a large DB cluster.➢ How to administer that giant mess of DBs.

Types of large clusters:➢ Just a bunch of databases.➢ Distributed database across machines.

Methods of Sharding MySQLWhere the Focus will Lie

12% – Sharding theory/considerations.

25% – Building a Cluster to administer (tutorial):➢ Palomino Cluster Tool.

50% – Flexible large-cluster administration (tutorial):➢ Tumblr's Jetpants.

13% – Other sharding technologies (talk-only):➢ Youtube's Vtocc (Vitess),➢ Twitter's Gizzard,➢ HAproxy.

Methods of Sharding MySQLWhat about the Silver Bullets?

NoSQL Distributed Databases:➢ Promise “sharding” for free,➢ Uptime and horizontal scaling trivially.

Reality:➢ RDBMS is 40-yr-old tech,➢ NoSQL is 10-yr-old tech,➢ Which responsible for how many high-profile

downtimes in the past 10 years?➢ Evaluate the alternatives without illusions.

Methods of Sharding MySQLWhat is a Shard?

A location for a subset of data:➢ Itself made of pieces.➢ Typically itself redundant.

Slave

Master

Slave

Shard for User Data

Slave

Master

Slave

Shard for Logging Data

Slave

Master

Slave

Shard for Posts Data

Slave Slave Slave

Methods of Sharding MySQLWhat are the Sharding Method Choices?

By-Function:➢ Move busy tables onto new shard.➢ Writes of busiest tables on new hardware.➢ Writes of remaining tables on current.

By-Columns:➢ Split table into chunks of related columns,

store each set on its own Master/Slaves shard.By-Rows:➢ A table is split into N shards, shard gets a

subset of the rows of the table.

Methods of Sharding MySQLShard Method Choices

By-function and By-Column Methods:➢ Much easier.➢ Can get you through months to years.➢ Eventually you run out of options here.

By-Row Method:➢ The hardest to do.➢ Requires new ways of accessing data.➢ Often requires sophisticated cache strategies.➢ Itself can be done several ways.

Methods of Sharding MySQLBy-Function Sharding

Picking a Functional Split:➢ A subset of tables commonly joined.➢ Tables outside this subset nearly never joined.➢ One of them responsible for many writes.

Every table outside subset requires rewriting JOINs into code-based multi-SELECTs.

Once subset of tables moved onto their own server, writes are distributed.

Methods of Sharding MySQLBy-Column Sharding (Vertical Partition)

Identifying candidate table:➢ Many columns (“users” anyone?),➢ Many updates,➢ Many indexes.

Required: even split of columns/indexes by update frequency. Attempt: logical grouping.

JOINs not possible nor desireable: write multi-SELECT code in application DAL.

Methods of Sharding MySQLRow-based Sharding Choices

Range-based Sharding:➢ Easy to understand.➢ Each shard gets a range of rows.➢ Oft-times some shards are “hot.”➢ Hot shards are split into separate shards.➢ Cold shards are joined into a single shard.➢ Juggling shard load is a frequent process.

Typically the best solution. Shortcomings have known work-arounds.

Methods of Sharding MySQLRow-based Sharding Choices

Modulus/Hash-based Sharding:➢ Row key is hashed to integer modulo number

of shards, then placed on that shard.➢ Only rarely are some shards are “hot.”➢ Shard splitting is difficult to implement.

Also a common method of sharding. We hope not to split shards often (or ever).

When we do, it's a multi-week process.

Methods of Sharding MySQLRow-based Sharding Choices

Lookup Table-based Sharding:➢ Easy to understand.➢ Row key mapped to shard in a lookup table.➢ Easy to move load off hot shards.➢ Lookup table method is problematic:

➢ Single point of failure.➢ Performance bottleneck.➢ Billions of rows, itself may need sharding.

Prerequisite: Build a Large ClusterAllocating the Hardware

Getting Hardware – your own company's:➢ Can be politically-charged.➢ Get a small batch first.➢ Build small demonstration cluster.➢ Get everyone on-board with the demo.

Renting/Leasing Hardware – the Cloud:➢ Allocate hardware in EC2 or elsewhere.➢ Usually easier, but possibly harder admin:

➢ Hardware failure more common.➢ Hardware/network flakiness more common.

Prerequisite: Build a Large ClusterBuilding the Cluster

Okay, I've got the hardware. What next?

Prerequisite: Build a Large ClusterBuilding the Cluster

Configuring the Hardware. The old dilemma:➢ Spend days to install/configure DB software?

Subsequent management is painful.➢ Use SSH in “for” loops?

Rolling your own configuration management tools is a lot of work.

➢ Learn a configuration management tool?Obvious choice in 2012. Well-documented tools like Chef, Puppet, Ansible.

Configuration Management ToolsMy Experience

Puppet: 6 years ago at Digg➢ Manage/Deploy of hundreds of servers.➢ Painful, but not as bad as hand-coding it all.

Chef: 2 years ago at Drawn to Scale and Riot➢ Manage/Deploy dozens of servers.➢ Learning Ruby is a “joy” of its own.

Ansible: 6 months ago at Palomino➢ Manage/Deploy dozens of servers.➢ First Palomino Cluster Tool subset built.

Prerequisite: Build a Large ClusterConfiguration Management Options

Pick your Configuration Management:➢ Chef: Popular, use Ruby to “code your

infrastructure.” Must learn Ruby.➢ Puppet: Mature, use data structures to “define

your infrastructure.” Less coding.➢ Ansible: Tiny and modular, similar to Puppet,

but with ordering for deployment. Pragmatic.Write/Get Recipes, Manifests, Playbooks?➢ Writing is tedious. Can take >1 week.➢ Get from internet? Often incomplete.

Prerequisite: Build a Large ClusterThe Palomino Cluster Tool

Palomino's tool for building large DB clusters:➢ Chef, Puppet, Ansible modules.➢ Open-source on Github.

➢ https://github.com/time-palominodb/PalominoClusterTool

➢ Google: “Palomino Cluster Tool.”➢ Will build a large cluster for you in hours:

➢ Master(s)➢ Slaves – hundreds of them as easy as two.➢ MHA – when master fails, a slave takes over.

➢ Previously this would take days.

The Palomino Cluster ToolBuilding the Management Node

Cluster Management Node:➢ Will build the initial cluster.➢ Will do subsequent cluster management.

Tool for Initial Cluster Build:➢ Palomino Cluster Tool (Ansible subset).

Tool for Cluster Management:➢ Jetpants (Ruby).

The Palomino Cluster ToolBuilding the Management Node

Palomino Cluster Tool (Ansible subset).

Why Ansible?➢ No server to set up, simply uses SSH.➢ Easy-to-understand non-code Playbooks.➢ Use a language you know for modules.➢ For demo purposes, obvious choice.➢ Also production-worthy:

➢ Built by Michael DeHaan, long-time configuration management guru.

The Palomino Cluster ToolBuilding the Management Node

Management node lives alongside your cluster.➢ We are building our cluster in EC2.➢ Thus management node in EC2.➢ This tutorial assumes Ubuntu 12.04.➢ t1.micro is fine for management node.

Install basic tools:➢ apt-get install git (for Ansible/P.C.T.)➢ apt-get install make python-jinja2 (for

Ansible)

The Palomino Cluster ToolConfiguring the Management Node

Install Ansible:➢ git clone git://github.com/ansible/ansible.git➢ make install

Install Palomino Cluster Tool:➢ git clone git://github.com/time-

palominodb/PalominoClusterTool.git

I think we just finished the management node!

The Palomino Cluster ToolAllocating Shard Nodes

Shard nodes:➢ m1.small or larger: at least 1.6GB RAM,➢ :3306, :80, and :22 open between all (one

security group in EC2),➢ Ubuntu 12.04 (other Debian-alikes at your

own risk – but may work!).

Do not need OS/database configuration:➢ Ansible will configure them.

The Palomino Cluster ToolBuilding the First Shard – Step 1

From README: edit IP addresses in cluster layout file (PalominoClusterToolLayout.ini):

# Alerting/Trending -----[alertmaster]10.252.157.110[trendmaster]10.252.157.110

# Servers -----[mhamanager]10.252.157.110

This section identical for all Shards.

The Palomino Cluster ToolBuilding the First Shard – Step 2

From README: edit IP addresses in cluster layout file (PalominoClusterToolLayout.ini):

[mysqlmasters]10.244.17.6

[mysqlslaves]10.244.26.19910.244.18.178

[mysqls:vars]master_host=10.244.17.6

This section different for every Shard.

The Palomino Cluster ToolBuilding the First Shard – Step 3

Run setup command to put configuration and SSH keys into /etc:

$ cd PalominoClusterTool/AnsiblePlaybooks/Ubuntu-12.04$ ./00-Setup_PalominoClusterTool.sh ShardA

Run build command – it's a wrapper around Ansible Playbooks:

$ ./10-MySQL_MHA_Manager.sh ShardA

The Palomino Cluster ToolBuilding the Second Shard

Just make one shard with a master and many slaves. In real life, you might do something like this instead:

for i in ShardB ShardC ShardD ; do (manual step): vim PalominoClusterToolLayout.ini (scriptable steps): ./00-Setup_PalominoClusterTool.sh $i ./10-MySQL_MHA_Manager.sh $idone

Run them in separate terminals to save time.

Make the Cluster RealData makes Shard Split Interesting

Fill ShardA using random data script.*

Palomino Cluster Tool includes such a tool.➢ HelperScripts/makeGiantDatafile.pl

$ ssh root@sharda-master# cd PalominoClusterTool/HelperScripts# mysql -e 'create database palomino'# ./makeGiantDatafile.pl 1200000 3 | mysql -f palomino

Install Jetpants, do shard split now.* Be sure /var/lib/mysql is on large partition!

Administering the ClusterInstall Jetpants

General idea: Install Ruby >=1.9.2 and RubyGems, then Jetpants via RubyGems.

On my systems, /etc/alternatives always incorrect, ln the proper binaries for Jetpants.

# apt-get install ruby1.9.3 rubygems libmysqlclient-dev# ln -sf /usr/bin/ruby1.9.3 /etc/alternatives/ruby# ln -sf /usr/bin/gem1.9.3 /etc/alternatives/gem# gem install jetpants

Administering the ClusterConfigure Jetpants

General idea: edit /etc/jetpants.yaml and create/own Jetpants inventory and application configuration to Jetpants user:

# vim /etc/jetpants.yaml# mkdir -p /var/jetpants# touch /var/jetpants/assets.json# chown jetpantsusr: /var/jetpants/assets.json# mkdir -p /var/www# touch /var/www/databases.yaml# chown jetpantsusr: /var/www/databases.yaml

Administering the ClusterJetpants Shard Splits

Tell Jetpants Console about your ShardA:

Jetpants> s = Shard.new(1, 999999999, '10.12.34.56', :ready) #10.12.34.56==ShardA masterJetpants> s.sync_configuration

Create spares within Console for all others (improved workflow in Jetpants 0.7.8):

Jetpants> topology.tracker.spares << '10.23.45.67'Jetpants> topology.tracker.spares << '10.23.45.68'Jetpants> topology.tracker.spares << '10.23.45.69'Jetpants> topology.write_configJetpants> topology.update_tracker_data

Administering the ClusterJetpants Shard Splits

Just for this tutorial:➢ Create the “palomino” database,➢ Break the replication on all the spares,➢ Be sure spares are read/write:

➢ Edit my.cnf,➢ service mysql restart

➢ Ensure “jetpants pools” proper:➢ One master,➢ Two slaves.

Administering the ClusterJetpants Shard Splits

How to perform an actual Shard Split:

$ jetpants shard_split --min-id=1 --max-id=999999999

Notes:➢ Process takes hours. Use screen or nohup.➢ LeftID == parent's first, RightID == parent's

last, no overlap/gap.➢ Make children 1-300000,300001-999999999.

Jetpants Shard SplittingThe Gory Details

After “jetpants shard_split”:ubuntu@ip-10-252-157-110:~$ jetpants poolsshard-1-999999999 [3GB]master = 10.244.136.107 ip-10-244-136-107 standby slave 1 = 10.244.143.195 ip-10-244-143-195 standby slave 2 = 10.244.31.91 ip-10-244-31-91 shard-1-400000 (state: replicating) [2GB]master = 10.244.144.183 ip-10-244-144-183 shard-400001-999999999 (state: replicating) [1GB]master = 10.244.146.27 ip-10-244-146-27

0 global pools 3 shard pools---- -------------- 3 total pools

3 masters 0 active slaves 2 standby slaves 0 backup slaves---- -------------- 5 total nodes

Jetpants ImprovementsThe Result of an Experiment

Jetpants only well-tested on RHEL/CentOS.

Palomino Cluster Tool only well-tested to build Ubuntu 12.04 clusters.

Little effort to fix Jetpants:➢ /sbin/service location different,➢ service mysql status output different.

Jetpants ImprovementsThe Result of an Experiment

Jetpants only well-tested on MySQL 5.1.

I built a cluster of MySQL 5.5.

A little more effort to fix Jetpants:➢ Set master_host=' ' is syntax error,➢ reset slave needs keyword “all” appended.

Jetpants ImprovementsThe Result of an Experiment

Jetpants only well-tested on large datasets.

I built a cluster with only hundreds of MB.

A wee tad more effort to fix Jetpants:➢ Some timings assumed large datasets,➢ Edge cases for small/quick operations

reported back to the author.

Jetpants ImprovementsOSS Collaboration and Win

Evan Elias implemented these fixes last week!➢ jetpants add_pool,➢ jetpants add_shard,➢ jetpants add_spare (with sanity-check spare),➢ Shards with 1 slave (not for prod!),➢ read_only spares not fatal,➢ Debian-alike (Ubuntu) fixes,➢ MySQL 5.5 fixes,➢ Mid-split Jetpants pools output simpler.

Really responsive ownership of project!

Twitter's GizzardWhat is it?

General Framework for distributed database.➢ Hides sharding from you.➢ Literally, it is middleware.

➢ Applications connect to Gizzard,➢ Gizzard sends connections to proper place,➢ Shard splits and hardware failure taken care of.

➢ Created at Twitter by rogue cowboys.➢ Not completely production-ready.

➢ Better than rolling your own!

Twitter's GizzardWhy should I use it?

You've settled on row-based partition scheme:➢ Master nearing I/O capacity, won't scale up,➢ Can't move some tables to their own pool,➢ Can't split the columns/indexes out,➢ You want to keep using the DBMS you

already know and love: Percona Server.*➢ Don't want to think about fault-tolerance or

shard splits (much),

* Actually use any storage back-end.

Twitter's GizzardThe Fine Print

This sounds perfect. Why not Gizzard?

Writes must follow strict diet. Must be:➢ Idempotent*,➢ Commutative**,➢ Must not have tuberculosis.

* Pfizer cannot remove the idempotency requirement of Gizzard.** Even on evenings and weekends.

Twitter's GizzardExpanding the Fine Print

Idempotency:➢ Submit a write. Again. And again.➢ Must be identical to doing it once.➢ Bad: “update set col = col + 1”

Commutative – writes in arbitrary order:➢ WriteA→WriteB→WriteC on Node1.➢ WriteB→WriteC→WriteA on Node2.➢ Bad: “update set col1 = 42”→“update set

col2 = col1 + 5”

Twitter's GizzardExpanding the Fine Print

Cluster is Eventually Consistent:➢ May return old values for reads.➢ Unknown when consistency will occur.

Like a politician's position on the budget:➢ Might be consistent in the future.➢ Just not right now.➢ Or now.

Twitter's GizzardWorking Around the Shortcomings

Gizzard work-around:➢ Add timestamp to every transaction.➢ Good:

➢ “col1.ts=1; update set col1=42” →➢ “update set col2=col1 + 5 where col1.ts=1”

➢ Implementation trickier if DBMS doesn't support column attributes.

Cannot escape: must radically re-think schema and application/DBMS interaction.

Twitter's GizzardTrying it Out

I'm convinced! How do I begin?➢ Learn Scala.➢ Clone “rowz” from Github.

➢ https://github.com/twitter/Rowz➢ Modify it to suit your needs.➢ Learn how it interacts with existing tools.➢ Write new monitoring/alerting plugins.➢ Write unit tests!➢ You should OSS it to help with overhead.

Twitter's GizzardTrying it Out

Sounds daunting. Maybe I'll roll my own?

Learn from others' mistakes:➢ Digg: 2 engineers 6 months. Code thrown

away. Digg out of business.➢ Countless identical stories in Silicon Valley.

NIHS attitude == Go out of business*.

* 8-figure R&D budgets excepted.

Youtube's Vitess/VtoccWhat is it?

Vitess is a library. Vtocc is an implemenation using it.

Vtocc is another middleware solution.➢ Sharding,➢ Caching,➢ Connection-pooling,➢ In-use at Youtube,➢ Built-in fail-safe features.

Youtube's VtoccWhy use it?

Proven high-volume sharding solution.

Interesting feature-list:➢ Auto query/transaction over-limit killing.➢ Better query-cache implementation.➢ Query comment-stripping for query cache.➢ Query consolidation.➢ Zero downtime restarts.

Less coding than Gizzard (more plug-in).

Youtube's VtoccHold on, Zero Downtime Restarts?

Just start new Vtocc instance.➢ Instance1 passes new requests to Instance2,➢ Instance1's connections get 30s to complete,➢ Instance2 kills Instance1 and takes over.

Vtocc Instance 1

Vtocc Instance 2

Youtube's VtoccThe Fine Print

Requires Particular Primary Keys:➢ varbinary datatype,➢ Choose carefully to prevent hot-spots.

Max result-set size: larger resultsets fail.

Additional administration burden:➢ “My query was killed. Why?”➢ Middleware adds spooky hard-to-diagnose

failure modes.

Youtube's VtoccImplementation Details

➢ Run Vtocc on same server as MySQL.➢ Configure Vtocc fail-safes for expected load:

➢ Pool Size (connection count),➢ Max Transactions (has own connection pool),➢ Query Timeout (before killed),➢ Transaction Timeout (before killed),➢ Max Resultset Size in rows

➢ Go language doesn't free allocated memory, so pick this value carefully.

➢ More details: http://code.google.com/p/vitess/wiki/Operations

HAproxyRe-thinking Proxy Topology

Old-school Proxy Topology:➢ DB Clients one one side,➢ DB Servers on the other,➢ Proxy in-between.

Single Point of Failure

HAproxyRe-thinking Proxy Topology

Free proxy provides new architecture option:➢ Proxy on every DB client node.➢ Good-bye single-point-of-failure.➢ Hello configuration management for proxy.

HAproxy

HAproxy

HAproxy

HAproxy

HAproxy

Methods of Sharding MySQLQ&A

Questions? Suggestions:➢ Interesting stuff. Got a job for me?➢ Well I got a job for you. Interested?➢ Warn me next time so I can sleep in the back

row.➢ Was that a question?

Thank you! Emails to domain palominodb, username time. Percona Live 2012 in New York City. Enjoy the rest of the show!