1,000,000 daily users and no cache

O’Reilly RailsConf, 2011-‐05-‐18

Who is that guy?

Jesper Richter-‐Reichhelm / @jrirei

Berlin, Germany

Head of Engineering @ wooga

Wooga does social games

Wooga has dedicated game teams

PHPCloud

RubyCloud

RubyBare metal

ErlangCloud

Coomingsoon

Flash client sends state changes to backend

Flash client Ruby backend

Social games need to scale quite a bit

1 million daily users

200 million daily HTTP requests

100,000 DB operaGons per second

40,000 DB updates per second

100,000 DB operaGons per second

40,000 DB updates per second

… but no cache

A journey to 1,000,000 daily users

Start of the journey

6 weeks of pain

Paradise

Looking back

We used two MySQL shards right from the start

data_fabric gem

Easy to use

Sharding by Facebook’s user id

Sharding on controller level

Scaling applicaLon servers was very easy

Running at Amazon EC2

Scalarium for automaGon and deploymentRole basedChef recipes for customiza5on

Scaling app servers up and out was simple

Master-‐slave replicaLon for DBs worked fine

app app app

dbslave

We added a few applicaLon servers over Lme

app app app app app app app app app

dbslave

Basic setup worked well for 3 months

DecMay Jun Jul Aug Sep Oct Nov

6 weeks of pain

Paradise

Looking back

SQL queries generated by Rubyamf gem

Rubyamf serializes returned objects

Wrong config => it also loads associaGons

SQL queries generated by Rubyamf gem

Rubyamf serializes returned objects

Wrong config => it also loads associaGons

Really bad but very easy to fix

New Relic shows what happens during an API call

More traffic using the same cluster

dbslave

Our DB problems began

MySQL hiccups were becoming more frequent

Everything was running fine ...

... and then DB throughput went down to 10%

A]er a few minutes everything stabilizes again ...

... just to repeat the cycle 20 minutes later!

AcLveRecord’s checks caused 20% extra DB load

Check connecGon state before each SQL query

MySQL process list full of ‘status’ calls

AcLveRecord’s status checks caused 20% extra DB

Check connecGon state before each SQL query

MySQL process list full of ‘status’ calls

I/O on MySQL masters sLll was the boYleneck

New Relic shows DB table usage

60% of all UPDATEs were done on the ‘Gles’ table

New Relic shows most used tables

Tiles are part of the core game loop

Core game loop:1) plant2) wait3) harvest

All are operaGonson Tiles table.

We started to shard on model, too

We put this table in 2 extra shards

old master

old slave

We put this table in 2 extra shards1) Setup new masters as slaves of old ones

old master

old slave

new master

We put this table in 2 extra shards1) Setup new masters as slaves of old ones

old master

old slave

new master

new slave

We put this table in 2 extra shards1) Setup new masters as slaves of old ones2) App servers start using new masters, too

old master

old slave

new master

new slave

We put this table in 2 extra shards1) Setup new masters as slaves of old ones2) App servers start using new masters, too3) Cut replica5on

old master

old slave

new master

new slave

We put this table in 2 extra shards1) Setup new masters as slave of old ones2) App servers start using new masters, too3) Cut replica5on4) Truncate not-‐used tables

old master

old slave

new master

new slave

8 DBs and a few more servers

app app

app app app app app app app app

app appapp

5lesdb

5lesslave

5lesdb

5lesslave

dbslave

app app app

Doubling the amount of DBs didn’t fix it

We improved our MySQL setup

RAID-‐0 of EBS volumes

Using XtraDB instead of vanilla MySQL

Tweaking my.cnfinnodb_flush_log_at_trx_commitinnodb_flush_method

Data-‐fabric gem circumvented AR’s internal cache

2x Tile.find_by_id(id) => 1x SELECT …

Data-‐fabric gem circumvented AR’s internal cache

2x Tile.find_by_id(id) => 1x SELECT …

2 + 2 masters and sLll I/O was not fast enough

We were geing desperate:

“If 2 + 2 is not enough, ...

… perhaps 4 + 4 masters will do?”

It’s no fun to handle 16 MySQL DBs

app app app app appapp app

appapp

5lesdb

5lesslave

5lesdb

5lesslave

dbslave

It’s no fun to handle 16 MySQL DBs

app app app app appapp app

appapp

5lesdb

5lesslave

5lesdb

5lesslave

5lesdb

5lesslave

5lesdb

5lesslave

dbslave

We were at a dead end with MySQL

I/O remained the boYleneck for MySQL UPDATEs

Peak throughput overall: 5,000 writes/s

Peak throughput single master: 850 writes/s

We could get to ~1,000 writes/s …

… but then what?

Pick the right tool for the job!

Redis is fast but goes beyond simple key/value

Redis is a key-‐value storeHashes, Sets, Sorted Sets, ListsAtomic opera5ons like set, get, increment

50,000 transacGons/s on EC2Writes are as fast as reads

Shelf Lles : An ideal candidate for using Redis

Shelf Lles:{ plant1 => 184,plant2 => 141,plant3 => 130,plant4 => 112,

Shelf Lles : An ideal candidate for using Redis

Redis HashHGETALL: load whole hashHGET: load a single keyHSET: set a single keyHINCRBY: increase value of single key…

Migrate on the fly when accessing new model

Migrate on the fly -‐ but only once

true if id could be addedelse false

Migrate on the fly -‐ and clean up later

1. Let this running everything cools down

2. Then migrate the rest manually

3. Remove the migraGon code

4. Wait unGl no fallback necessary

5. Remove the SQL table

MigraLons on the fly all look the same

Typical migraGon throughput over 3 daysIni5al peak at 100 migra5ons / second

6 weeks of pain

Paredise (or not?)

Looking back

Again: Tiles are part of the core game loop

Core game loop:1) plant2) wait3) harvest

All are operaGonson Tiles table.

Size maYers for migraLons

MigraGon check overloaded RedisMigra5on only on startup

We overlooked an edge caseNext 5me only migrate 1% of usersMigrate the rest when everything worked out

In-‐memory DBs don’t like to saving to disk

You sGll need to write data to disk eventuallySAVE is blockingBGSAVE needs free RAM

Dumping on master increased latency by 100%

In-‐memory DBs don’t like to dump to disk

You sGll need to write data to disk eventuallySAVE is blockingBGSAVE needs free RAM

Dumping on master increased latency by 100%

Running BGSAVE slaves every 15 minutes

ReplicaLon puts Redis master under load

ReplicaGon on master starts with a BGSAVENot enough free RAM => big problem

Master queue requests that slave cannot processRestoring dump on slaves is blocking

Redis has a memory fragmenLon problem

in 8 days

Redis has a memory fragmenLon problem

in 3 days

If MySQL is a truck...

Fast enough for reads

Can store on disk

Robust replicaGon

h\p://www.flickr.com/photos/erix/245657047/

If MySQL is a truck, Redis is a Ferrari

Can store on disk

Robust replicaGon

Super fast reads/writes

Out of memory => dead

Fragile replicaGon

h\p://www.flickr.com/photos/erix/245657047/

Big and staLc data in MySQL, rest goes to Redis

Can store on disk

Robust replicaGon

Super fast reads/writes

Out of memory => dead

Fragile replicaGon

60 GB data

50% writes

256 GB data

10% writesh\p://www.flickr.com/photos/erix/245657047/

95% of operaLon effort is handling DBs!

app app app app app app app app app app app appapp

redisslave

dbslave

We fixed all our problem -‐ we were in Paradise!

6 weeks of pain

Paredise (or not?)

Looking back

Data handling is most important

How much data will you have?

How o]en will you read/update that data?

What data is in the client, what in the server?

It’s hard to “refactor” data later on

Monitor and measure your applicaLon closely

On applicaGon level / on OS level

Learn your app’s normal behavior

Store historical data so you can compare later

React if necessary!

Listen closely and answer truthfully

slideshare.net/wooga

Jesper Richter-‐Reichhelm

@jrirei

If you are in Berlin just drop by

Thank you

slideshare.net/wooga

Jesper Richter-‐Reichhelm

@jrirei

1,000,000 daily users and no cache

Technology

MT3620 · 2019-11-07 · Azure Sphere solution are available to MT3620 end-users. 1.2 Main Features 1.2.1 Platform instruction cache, 32kB L1 data cache, 256kB L2 cache, and 4MB system

REINVENTING CUSTOMER EXPERIENCES...3.5m Users €500m in funding 3m Users 50k new accounts/week 7m Users 4.9 in iphone appstore Opinion On Wall Street Cache crunch: Google-Citi deal

Spending 1,000,000

Cache in Chromium: Disk Cache

Users Guide - Seagate · Users Guide. Cheetah 10K.7 SCSI ST3300007LW/LC ... 4.5.3 Optimizing cache performance for desktop and server applications ... Cheetah 10K.7 SCSI Product Manual,

The $1,000,000 Netflix Contest

Azure Redis Cache - Cache on Steroids!

Library Cache Lock Mutex Row Cache

DNS Cache Poisoning Attack - OWASP€¦ · - April 2018, a major DNS cache poisoning attack compromised Amazon’s DNS servers, redirecting users to malicious web sites. - November

LogiCORE IP System Cache v1The Cache memory provides the actual cache functionality in the System Cache. The cache is configurable in terms of size and associativity. The cache size

Cache me outside - Umbraco Spark · Cache me outside ANTHONY DANG HEAD OF TECHNOLOGY, THE COGWORKS ... Partial Cache Output Cache / Donut Cache Custom Inline (method-level) Cache

S cache: Thwarting Cache Attacks via Cache Set Randomization

Lauren’s $1,000,000 Vacation House

[1] [2] [3] [4] [5] [6] [6] - [3] 1,000,000 - 1,000,000 ... · Numbers and Language Non Mental Health Transactions District by Impact Drop Zero Funded Projects ... - -1,000,000 -100.0

Database Concurrency Control and RecoveryElectronic version is current, paper directories are an official cache Frequency of update (some years ago): Cambridge area 1,000,000 entries,

Creating a Mobile Map Cache from ArcGIS Desktop GCreating a Mobile Map Ca… · SM Creating a Mobile Map Cache from ArcGIS Desktop 2011 Mid-West ESRI Utility Users Group

cache. ... cache

Turkey 1,000,000

Gumball: A Race Condition Prevention Technique for Cache ...dblab.usc.edu/Users/papers/GumballTR.pdf · Gumball: A Race Condition Prevention Technique for Cache Augmented SQL Database

Cache memory and cache