80
© 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Nate Wiger, Principal Solutions Architect, AWS Tom Kerr, Software Engineer, Riot Games October 8, 2015 Amazon ElastiCache Deep Dive Scaling Your Data in a Real-Time World DAT407

(DAT407) Amazon ElastiCache: Deep Dive

Embed Size (px)

Citation preview

Page 1: (DAT407) Amazon ElastiCache: Deep Dive

© 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Nate Wiger, Principal Solutions Architect, AWS

Tom Kerr, Software Engineer, Riot Games

October 8, 2015

Amazon ElastiCache Deep DiveScaling Your Data in a Real-Time World

DAT407

Page 2: (DAT407) Amazon ElastiCache: Deep Dive

Amazon ElastiCache

• Managed in-memory service

• Memcached or Redis

• Cluster of nodes

• Read replicas

• Monitoring + alerts

Page 3: (DAT407) Amazon ElastiCache: Deep Dive

ELB App

External APIs

Modern Web / Mobile App

Page 4: (DAT407) Amazon ElastiCache: Deep Dive

Memcached vs Redis

• Flat string cache

• Multithreaded

• No persistence

• Low maintenance

• Easy to scale horizontally

• Single-threaded

• Persistence

• Atomic operations

• Advanced data types -

http://redis.io/topics/data-types

• Pub/sub messaging

• Read replicas / failover

Page 5: (DAT407) Amazon ElastiCache: Deep Dive

Storing JSON – Memcached vs Redis

# Memcached: Serialize string

str_json = Encode({“name”: “Nate Wiger”, “gender”: “M”})

SET user:nateware str_json

GET user:nateware

json = Decode(str_json)

# Redis: Use a hash!

HMSET user:nateware name “Nate Wiger” gender M

HGET user:nateware name

>> Nate Wiger

HMGET user:nateware name gender

>> Nate Wiger

>> M

Page 6: (DAT407) Amazon ElastiCache: Deep Dive

ElastiCache with

Page 7: (DAT407) Amazon ElastiCache: Deep Dive

ElastiCache with Memcached – Development

Region

Availability Zone A Availability Zone B

Auto Scaling group

ElastiCache cluster

Page 8: (DAT407) Amazon ElastiCache: Deep Dive

ElastiCache with Memcached – Development

Region

Availability Zone A Availability Zone B

Auto Scaling group

ElastiCache cluster

Nope

Page 9: (DAT407) Amazon ElastiCache: Deep Dive

Add Nodes to Memcached Cluster

Page 10: (DAT407) Amazon ElastiCache: Deep Dive

Add Nodes to Memcached Cluster

Page 11: (DAT407) Amazon ElastiCache: Deep Dive

Add Nodes to Memcached Cluster

aws elasticache modify-cache-cluster

--cache-cluster-id my-cache-cluster

--num-cache-nodes 4

--apply-immediately

# response

"CacheClusterStatus": "modifying",

"PendingModifiedValues": {

"NumCacheNodes": 4

},

Page 12: (DAT407) Amazon ElastiCache: Deep Dive

ElastiCache with Memcached – High Availability

Region

Availability Zone A Availability Zone B

Auto Scaling group

ElastiCache cluster

Page 13: (DAT407) Amazon ElastiCache: Deep Dive

ElastiCache with Memcached – Scale Out

Region

Availability Zone A Availability Zone B

Auto Scaling group

ElastiCache cluster

Page 14: (DAT407) Amazon ElastiCache: Deep Dive

Sharding

Page 15: (DAT407) Amazon ElastiCache: Deep Dive

Consistent HashingClient pre-calculates a hash ring for best key distribution

http://berb.github.io/diploma-thesis/original/062_internals.html

Page 16: (DAT407) Amazon ElastiCache: Deep Dive

It’s All Been Done Before• Ruby

• Dalli https://github.com/mperham/dalli

• Plus ElastiCache https://github.com/ktheory/dalli-elasticache

• Python• HashRing / MemcacheRing https://pypi.python.org/pypi/hash_ring/

• Django w/ Auto-Discovery https://github.com/gusdan/django-elasticache

• Node.js• node-memcached https://github.com/3rd-Eden/node-memcached

• Auto-Discovery example http://stackoverflow.com/questions/17046661

• Java• SpyMemcached https://github.com/dustin/java-memcached-client

• ElastiCache Client https://github.com/amazonwebservices/aws-elasticache-cluster-client-memcached-for-java

• PHP• ElastiCache Client https://github.com/awslabs/aws-elasticache-cluster-client-

memcached-for-php

• .NET• ElastiCache Client https://github.com/awslabs/elasticache-cluster-config-net

Page 17: (DAT407) Amazon ElastiCache: Deep Dive

Auto-Discovery Endpoint

Page 18: (DAT407) Amazon ElastiCache: Deep Dive

# PHP

$server_endpoint = "mycache.z2vq55.cfg.usw2.cache.amazonaws.com";

$cache = new Memcached();

$cache->setOption(

Memcached::OPT_CLIENT_MODE, Memcached::DYNAMIC_CLIENT_MODE);

# Set config endpoint as only server

$cache->addServer($server_endpoint, 11211);

DIY: http://bit.ly/elasticache-autodisc

Memcached Node Auto-Discovery

Page 19: (DAT407) Amazon ElastiCache: Deep Dive

App Caching Patterns

Page 20: (DAT407) Amazon ElastiCache: Deep Dive

Be Lazy

# Python

def get_user(user_id):

record = cache.get(user_id)

if record is None:

# Run a DB query

record = db.query("select * from users where id = ?", user_id)

cache.set(user_id, record)

return record

# App code

user = get_user(17)

Page 21: (DAT407) Amazon ElastiCache: Deep Dive

Write On Through

# Python

def save_user(user_id, values):

record = db.query("update users ... where id = ?", user_id, values)

cache.set(user_id, record)

return record

# App code

user = save_user(17, {"name": "Nate Dogg"})

Page 22: (DAT407) Amazon ElastiCache: Deep Dive

Combo Move!

def save_user(user_id, values):

record = db.query("update users ... where id = ?", user_id, values)

cache.set(user_id, record, 300) # TTL

return record

def get_user(user_id):

record = cache.get(user_id)

if record is None:

record = db.query("select * from users where id = ?", user_id)

cache.set(user_id, record, 300) # TTL

return record

# App code

save_user(17, {"name": "Nate Diddy"})

user = get_user(17)

Page 23: (DAT407) Amazon ElastiCache: Deep Dive

Web Cache with Memcached

# Gemfile

gem 'dalli-elasticache’

# config/environments/production.rb

endpoint = “mycluster.abc123.cfg.use1.cache.amazonaws.com:11211”

elasticache = Dalli::ElastiCache.new(endpoint)

config.cache_store = :dalli_store, elasticache.servers,

expires_in: 1.day, compress: true

# if you change ElastiCache cluster nodes

elasticache.refresh.client

Ruby on Rails Example

Page 24: (DAT407) Amazon ElastiCache: Deep Dive

Thundering Herd

Causes

• Cold cache – app startup

• Adding / removing nodes

• Cache key expiration (TTL)

• Out of cache memory

Large # of cache misses

Spike in database load

Mitigations

• Script to populate cache

• Gradually scale nodes

• Randomize TTL values

• Monitor cache utilization

Page 25: (DAT407) Amazon ElastiCache: Deep Dive

ElastiCache with

Page 26: (DAT407) Amazon ElastiCache: Deep Dive

Not if I

destroy

it first!It’s

mine!

Need uniqueness + ordering

Easy with Redis Sorted Sets

ZADD "leaderboard" 1201 "Gollum”

ZADD "leaderboard" 963 "Sauron"

ZADD "leaderboard" 1092 "Bilbo"

ZADD "leaderboard" 1383 "Frodo”

ZREVRANGE "leaderboard" 0 -1

1) "Frodo"

2) "Gollum"

3) "Bilbo"

4) "Sauron”

ZREVRANK "leaderboard" "Sauron"

(integer) 3

Real-time Leaderboard!

Page 27: (DAT407) Amazon ElastiCache: Deep Dive

Ex: Throttling requests to an API

Leverages Redis Counters

ELB

Externally

Facing

API

Reference: http://redis.io/commands/INCR

FUNCTION LIMIT_API_CALL(APIaccesskey)limit = HGET(APIaccesskey, “limit”)time = CURRENT_UNIX_TIME()keyname = APIaccesskey + ":” + timecount = GET(keyname)IF current != NULL && count > limit THEN

ERROR ”API request limit exceeded"ELSE

MULTIINCR(keyname)EXPIRE(keyname,10)

EXECPERFORM_API_CALL()

END

Rate Limiting

Page 28: (DAT407) Amazon ElastiCache: Deep Dive

• Redis counters – increment likes/dislikes

• Redis hashes – list of everyone’s ratings

• Process with algorithm like Slope One or Jaccardian similarity

• Ruby example - https://github.com/davidcelis/recommendable

Recommendation Engines

INCR item:38927:likesHSET item:38927:ratings "Susan" 1

INCR item:38927:dislikesHSET item:38927:ratings "Tommy" -1

Page 29: (DAT407) Amazon ElastiCache: Deep Dive

Chat and Messaging

• PUBLISH and SUBSCRIBE Redis commands

• Game or Mobile chat

• Server intercommunication

SUBSCRIBE chat_channel:114PUBLISH chat_channel:114 "Hello all"

["message", " chat_channel:114 ", "Hello all"]UNSUBSCRIBE chat_channel:114

Page 30: (DAT407) Amazon ElastiCache: Deep Dive

ElastiCache with Redis – Development

Region

Availability Zone A Availability Zone B

Auto Scaling group

ElastiCache cluster

Page 31: (DAT407) Amazon ElastiCache: Deep Dive

Availability Zone A Availability Zone B

Use Primary Endpoint

Use Read Replicas

Auto-Failover

Chooses replica with

lowest replication lag

DNS endpoint is same

Redis Multi-AZ

Page 32: (DAT407) Amazon ElastiCache: Deep Dive

ElastiCache with Redis Multi-AZ

Region

Availability Zone A Availability Zone B

Auto Scaling group

ElastiCache cluster

Page 33: (DAT407) Amazon ElastiCache: Deep Dive

ElastiCache with Redis Multi-AZ

Region

Availability Zone A Availability Zone B

Auto Scaling group

ElastiCache cluster

Page 34: (DAT407) Amazon ElastiCache: Deep Dive

ElastiCache with Redis Multi-AZ

Region

Availability Zone A Availability Zone B

Auto Scaling group

ElastiCache cluster

Page 35: (DAT407) Amazon ElastiCache: Deep Dive

ElastiCache with Redis Multi-AZ

Region

Availability Zone A Availability Zone B

Auto Scaling group

ElastiCache cluster

Page 36: (DAT407) Amazon ElastiCache: Deep Dive
Page 37: (DAT407) Amazon ElastiCache: Deep Dive
Page 38: (DAT407) Amazon ElastiCache: Deep Dive
Page 39: (DAT407) Amazon ElastiCache: Deep Dive

Redis Multi-AZ – Reads and Writes

ELB App

External APIs

Replication Group

ReadsWrites

Page 40: (DAT407) Amazon ElastiCache: Deep Dive

Redis – Read/Write Connections

# Ruby example

redis_write = Redis.new(

'mygame-dev.z2vq55.ng.0001.usw2.cache.amazonaws.com')

redis_read = Redis::Distributed.new([

'mygame-dev-002.z2vq55.ng.0001.usw2.cache.amazonaws.com',

'mygame-dev-003.z2vq55.ng.0001.usw2.cache.amazonaws.com'

])

redis_write.zset("leaderboard", "nateware", 1976)

top_10 = redis_read.zrevrange("leaderboard", 0, 10)

Page 41: (DAT407) Amazon ElastiCache: Deep Dive

Recap – Endpoint Autodetection

• Cluster endpoints:

aws elasticache describe-cache-clusters

--cache-cluster-id mycluster

--show-cache-node-info

• Redis read replica endpoints:

aws elasticache describe-replication-groups

--replication-group-id myredisgroup

• Can listen for SNS events: http://bit.ly/elasticache-sns

http://bit.ly/elasticache-whitepaper

Page 42: (DAT407) Amazon ElastiCache: Deep Dive

Splitting Redis By Purpose

ELB App

External APIs

ReadsWrites

Replication Group

Leaderboards

Replication Group

User Profiles

Reads

Page 43: (DAT407) Amazon ElastiCache: Deep Dive

Don’t Plan Ahead!!

1. Start with one Redis Multi-AZ cluster

2. Split as needed

3. Scale read load via replicas

4. Rinse, repeat

Page 44: (DAT407) Amazon ElastiCache: Deep Dive

Tune It Up!

Page 45: (DAT407) Amazon ElastiCache: Deep Dive

Alarms

Monitoring with CloudWatch

• CPU

• Evictions

• Memory

• Swap Usage

• Network In/Out

Page 46: (DAT407) Amazon ElastiCache: Deep Dive

Key ElastiCache CloudWatch Metrics

• CPUUtilization

• Memcached – up to 90% ok

• Redis – divide by cores (ex: 90% / 4 = 22.5%)

• SwapUsage low

• CacheMisses / CacheHits Ratio low / stable

• Evictions near zero

• Exception: Russian doll caching

• CurrConnections stable

• Whitepaper: http://bit.ly/elasticache-whitepaper

Page 47: (DAT407) Amazon ElastiCache: Deep Dive

Scaling Up Redis

1. Snapshot existing cluster to Amazon S3

http://bit.ly/redis-snapshot

2. Spin up new Redis cluster from snapshot

http://bit.ly/redis-seeding

3. Profit!

4. Also good for debugging copy of production data

Page 48: (DAT407) Amazon ElastiCache: Deep Dive

Common Issues

Page 49: (DAT407) Amazon ElastiCache: Deep Dive

DNS Caching – Redis Failover

• Failover requires updating a DNS CNAME

• Can take up to two minutes

• Watch out for app DNS caching – esp. Java!

http://bit.ly/jvm-dns

• No API for triggering Redis failover• Turn off Multi-AZ temporarily

• Promote replica to primary

• Turn on Multi-AZ

Page 50: (DAT407) Amazon ElastiCache: Deep Dive

1. Forks main Redis process

2. Writes data to disk from child process

3. Continues to accept traffic on main process

4. Any key update causes a copy-on-write

5. Potentially DOUBLES memory usage by Redis

Swapping During Redis Backup (BGSAVE)

Page 51: (DAT407) Amazon ElastiCache: Deep Dive

Reduce memory allocated to Redis

• Set reserved-memory field in parameter groups

• Evicts more data from memory

Use larger cache node type

• More expensive

• But no data eviction

Write-heavy apps need extra Redis memory

Swapping During Redis Backup – Solutions

Page 52: (DAT407) Amazon ElastiCache: Deep Dive

Redis reserved-memory Parameter

Page 53: (DAT407) Amazon ElastiCache: Deep Dive

Redis Engine Enhancements

• Only Available in Amazon ElastiCache

• Forkless backups = Lower memory usage

• If enough memory, will still fork (faster)

• Improved replica sync under heavy write loads

• Smoother failovers (PSYNC)

• Two new CloudWatch metrics

• ReplicationBytes: Number of bytes sent from primary node

• SaveInProgress: 1/0 value that indicates if save is running

• Try it today! Redis 2.8.22 or later.`

Page 54: (DAT407) Amazon ElastiCache: Deep Dive

Riot Games: ElastiCache in the Wild

Tom Kerr

Page 55: (DAT407) Amazon ElastiCache: Deep Dive
Page 56: (DAT407) Amazon ElastiCache: Deep Dive

LEAGUE OF LEGENDS

Page 57: (DAT407) Amazon ElastiCache: Deep Dive
Page 58: (DAT407) Amazon ElastiCache: Deep Dive
Page 59: (DAT407) Amazon ElastiCache: Deep Dive

APOLLO

Page 60: (DAT407) Amazon ElastiCache: Deep Dive

APOLLO: COMMENTS ANYWHERE

Page 61: (DAT407) Amazon ElastiCache: Deep Dive

APOLLO: COMMENTS ANYWHERE

Page 62: (DAT407) Amazon ElastiCache: Deep Dive

APOLLO: ARCHITECTURE

Page 63: (DAT407) Amazon ElastiCache: Deep Dive

Replication with automatic failover

Replication across availability zones

More snapshots, more often

Page 64: (DAT407) Amazon ElastiCache: Deep Dive
Page 65: (DAT407) Amazon ElastiCache: Deep Dive
Page 66: (DAT407) Amazon ElastiCache: Deep Dive
Page 67: (DAT407) Amazon ElastiCache: Deep Dive
Page 68: (DAT407) Amazon ElastiCache: Deep Dive
Page 69: (DAT407) Amazon ElastiCache: Deep Dive

LESS GOOD

Fun Stuff Deploy Stuff

GOOD

Fun Stuff Deploy Stuff

Page 70: (DAT407) Amazon ElastiCache: Deep Dive

APOLLO

Page 71: (DAT407) Amazon ElastiCache: Deep Dive

LEADERBOARDS

Page 72: (DAT407) Amazon ElastiCache: Deep Dive
Page 73: (DAT407) Amazon ElastiCache: Deep Dive
Page 74: (DAT407) Amazon ElastiCache: Deep Dive

LEADERBOARDS: ARCHITECTURE

Page 75: (DAT407) Amazon ElastiCache: Deep Dive

LEADERBOARDS: DATA STORE

Page 76: (DAT407) Amazon ElastiCache: Deep Dive

US-WEST2:NA:3848433 37

US-WEST2:NA:3848 37433

http://redis.io/topics/memory-optimization

Page 77: (DAT407) Amazon ElastiCache: Deep Dive

LEADERBOARDS

Page 78: (DAT407) Amazon ElastiCache: Deep Dive

Replicas with automatic failoverBEST

PRACTICES

Manually snapshot more often

Monitor your replication metrics

Redis hash key trick

Page 79: (DAT407) Amazon ElastiCache: Deep Dive

Thank you!

Nate Wiger, Principal Solutions Architect, AWS

Tom Kerr, Software Engineer, Riot Games

Page 80: (DAT407) Amazon ElastiCache: Deep Dive

Remember to complete

your evaluations!