Scaling to 200K Transactions per Second with Open Source - MySQL, Java, curl, PHP

Preview:

DESCRIPTION

Scaling is fun

Citation preview

Scaling to 200K Transactions per Second with Open Source - MySQL, Java, curl, PHP

byDathan Vance Pattishall

ContentsWho am I

Introductions

Requirements

Design to solve requirements

Federation

Java (Friend Queries)

INNODB-isms

More Stats

Questions

Who am I?

Dathan Vance Pattishall

Chief Data-Layer Architect

Share on http://mysqldba.blogspot.comScaling a Widget Company

Federation at Flickr: Doing Billions of Queries per DayScaling a HUGE volume of concurrent writes

Worked at

Introduction

Now I work at RockYou

When I started

Facebook Shards do 100K TPS alone

MySpace, Hi5, Orkut, Ads, Main Site various other DB servers Sum to 100K TPS

On Less than 120 Database Servers

32 – 48 GB of RAM8 Disk RAID 10 with 256 MB PERC 6 Controller

We can support any Logical SQL Query

T E A M

The Requirements

• Scale Linear• Store some data forever• Allow for change• Keep it cheap• Oh and downtime is not an option

Design to Meet The Requirement

Need Redundancy

Need allot of IO bandwidth

Need to Remove

Replication Lag

Need a system to do processing

offline

Need to do it all without

downtime

Do it Cheaply

Federation

User 1’s DataUser 2’s DataUser 3’s Data

….User N’s Data

User 1’s Data

User 2’s Data

User 3’s Data

User N’s Data

Federation

Does NOT increase write

throughput

Federation

This Increases

WriteThroughput

How does one Federate?

Who / what owns the data

How can you answer any

question asked?

First need to handle master-master

replication

• No auto-increments• GUIDs Only• Bucket assignment • Data access follows a pattern

Enter Global Lookup Cluster

• Hash Lookups are fast, can do 45K qps single server

• Ownerid -> Shard_id• Groupid -> Shard_id• Tagid -> Shard_id• Url_id -> Shard_id• Front By memcache• Use consistent hashing to add capacity

horizontally and HA

Write Multiple Views of the Data

Inviter know who they invited, Invited knows who invited them

Keep Data Consistent

Write Data to Shard 1

Write Data to Shard 2

If Shard 2 says ok Commit Data on Shard 1

If Shard 1 says ok Commit Data on Shard 2

If any step fails ROLLBACK

Use Java App to Parallel this to remove race conditions

What if I need an ID to represent a row

REPLACE INTO Tickets VALUES(‘a’); Get a ID backCREATE TABLE `TicketsGeneric` ( `id` bigint(20) unsigned NOT NULL auto_increment, `stub` char(1) NOT NULL default '', PRIMARY KEY (`id`), UNIQUE KEY `stub` (`stub`)) ENGINE=MyISAM AUTO_INCREMENT=7445309740

But what if I need a global view of the table

• Cron Jobs• Front by Memcache• Offline Tasks to atomic write job and return the

page quickly i.e. defer writes to Many RECPT– Pure PHP– Like GEARMAND uses IPC distributed across servers– Does 100Million actions per day and scales linearly

• @see Friend Query Section

What about maintenanceHave redundancy take side down or rotate server into

master – master config

Alters

Optimize tables

Add new tables

Massive Deletes

Data Repair

What about Shard Misbalance?

Migrate them

• object_id -> shard_id, lock shard_id for object_id

• Migrate the user• If error die, send alert• Takes less then 30 seconds per primary object• Currently shards are self balancing, can

migrate 4 million users in 8 days, at slowest setting.

What about managing datasize

• Enter Shard Types– Archive Shard– Sub Shards

• One way a DBA can scale is to partition and allocate a server per table. Why not by partition shard types?

• Allows for bleeding edge techs, have 10 shards of XTRA-DB

What about Split Brain?

I allow writes on both servers in Master-Master Configs.

Stick Primary Object ID to a server

If you Read my Data, you Access my Data like I access my Data, same for writes.

If a server fails flip to redundant server

$PRIMARY_OBJECT_ID % $NUM_SERVS == BUCKET

Also gets rid of Slave lag for the most part.

Friend QueriesMULTI-GET from Shards

Jetty + J/Connect(AsyncShard Server)

• Can Query 8 shards at a time in parallel• Data is merged on the fly• JASON is the communication protocol• private ExecutorService exec =

Executors.newFixedThreadPool(8); // 4 CPU * .8 Ut (1 + W/C) =~ 8

J/Connect/* mysql-connector-java-5.1.7 ( Revision: ${svn.Revision} ) */SHOW VARIABLEWHERE Variable_name ='language' OR Variable_name = 'net_write_timeout' ORVariable_name = 'interactive_timeout' OR Variable_name = 'wait_timeout' ORVariable_name = 'character_set_client' OR Variable_name = 'character_set_connection' OR Variable_name = 'character_set' OR Variable_name ='character_set_server' OR Variable_name = 'tx_isolation' OR Variable_name ='transaction_isolation' OR Variable_name = 'character_set_results' OR Variable_name = 'timezone' OR Variable_name = 'time_zone' OR Variable_name = 'system_time_zone' OR Variable_name = 'lower_case_table_names' OR

Variable_name = 'max_allowed_packet' OR Variable_name = 'net_buffer_length' OR Variable_name = 'sql_mode' OR Variable_name = 'query_cache_type' OR Variable_name =

'query_cache_size' OR Variable_name = 'init_connect';

• Takes 180ms +

Fix

Add:&cacheServerConfiguration=true To your JDBC url directive

@see http://assets.en.oreilly.com/1/event/21/Connector_J%20Performance%20Gems%20Presentation.pdf

Writing Large Strings REALTIME

• Incrementing impressions is easy, but storing referrer URLS is not as easy in RealTime

• Why must know your limits of the Storage Engine you are using

INNODB & Strings

• Indexing a string takes a lot of space• Indexing a large string takes even more space• Each index has its own 16KB page.• Fragmentation across pages was hurting the

app – chewing up I/O• Lots of disk space chewed up per day• Due to a bunch of overhead with Strings &

Deadlock detection

INNODB & High Concurrency of Writes

• Requirement: 300 ms for total db access FOR ALL Apps

• Writes when the datafile(s) size is greater then the buffer_size-slow down at high concurrency

• 10 ms to 20 seconds sometimes for the full transaction

• Fixed by offloading the query to OfflineTask that writes it as a single thread.

Deadlock / Transaction Overhead Solved

• Put a Java daemon that buffers up to 4000 messages (transactions) and apply it serially with one thread

• It does not go down & if it does we can fail over• Log data to local disk for outstanding trans• It does not use much memory or cpu• Even during peak messages do not exceed 200

outstanding transactions

Disk Consumption solved

• Archive Data• Compress using INNODB 1.0.4• innodb_file_format = Barracuda• 8K Key Block Size – best bang for the buck for

our data. Less Key Block Size causes major slow down in transactions.

Stats Across All ServicesOver 17 billion Transactions per day can sustain 200K+++ TPS

Across 25 TB of data

300K Memcache Gets a Second

10 million active users per Shard

A Large % of ALL Major Social network users have a Rockyou Presence (Federate by SN/User)

99.999% uptime

All on the Fly Connections

All balancing handled by application

Use memcache to reduce latency can run without it!!!

Questions / Want to Work here?

dathan@rockyou.com

Recommended