© 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Nate Wiger, Principal Solutions Architect, AWS
Tom Kerr, Software Engineer, Riot Games
October 8, 2015
Amazon ElastiCache Deep DiveScaling Your Data in a Real-Time World
DAT407
Amazon ElastiCache
• Managed in-memory service
• Memcached or Redis
• Cluster of nodes
• Read replicas
• Monitoring + alerts
ELB App
External APIs
Modern Web / Mobile App
Memcached vs Redis
• Flat string cache
• Multithreaded
• No persistence
• Low maintenance
• Easy to scale horizontally
• Single-threaded
• Persistence
• Atomic operations
• Advanced data types -
http://redis.io/topics/data-types
• Pub/sub messaging
• Read replicas / failover
Storing JSON – Memcached vs Redis
# Memcached: Serialize string
str_json = Encode({“name”: “Nate Wiger”, “gender”: “M”})
SET user:nateware str_json
GET user:nateware
json = Decode(str_json)
# Redis: Use a hash!
HMSET user:nateware name “Nate Wiger” gender M
HGET user:nateware name
>> Nate Wiger
HMGET user:nateware name gender
>> Nate Wiger
>> M
ElastiCache with
ElastiCache with Memcached – Development
Region
Availability Zone A Availability Zone B
Auto Scaling group
ElastiCache cluster
ElastiCache with Memcached – Development
Region
Availability Zone A Availability Zone B
Auto Scaling group
ElastiCache cluster
Nope
Add Nodes to Memcached Cluster
Add Nodes to Memcached Cluster
Add Nodes to Memcached Cluster
aws elasticache modify-cache-cluster
--cache-cluster-id my-cache-cluster
--num-cache-nodes 4
--apply-immediately
# response
"CacheClusterStatus": "modifying",
"PendingModifiedValues": {
"NumCacheNodes": 4
},
ElastiCache with Memcached – High Availability
Region
Availability Zone A Availability Zone B
Auto Scaling group
ElastiCache cluster
ElastiCache with Memcached – Scale Out
Region
Availability Zone A Availability Zone B
Auto Scaling group
ElastiCache cluster
Sharding
Consistent HashingClient pre-calculates a hash ring for best key distribution
http://berb.github.io/diploma-thesis/original/062_internals.html
It’s All Been Done Before• Ruby
• Dalli https://github.com/mperham/dalli
• Plus ElastiCache https://github.com/ktheory/dalli-elasticache
• Python• HashRing / MemcacheRing https://pypi.python.org/pypi/hash_ring/
• Django w/ Auto-Discovery https://github.com/gusdan/django-elasticache
• Node.js• node-memcached https://github.com/3rd-Eden/node-memcached
• Auto-Discovery example http://stackoverflow.com/questions/17046661
• Java• SpyMemcached https://github.com/dustin/java-memcached-client
• ElastiCache Client https://github.com/amazonwebservices/aws-elasticache-cluster-client-memcached-for-java
• PHP• ElastiCache Client https://github.com/awslabs/aws-elasticache-cluster-client-
memcached-for-php
• .NET• ElastiCache Client https://github.com/awslabs/elasticache-cluster-config-net
Auto-Discovery Endpoint
# PHP
$server_endpoint = "mycache.z2vq55.cfg.usw2.cache.amazonaws.com";
$cache = new Memcached();
$cache->setOption(
Memcached::OPT_CLIENT_MODE, Memcached::DYNAMIC_CLIENT_MODE);
# Set config endpoint as only server
$cache->addServer($server_endpoint, 11211);
DIY: http://bit.ly/elasticache-autodisc
Memcached Node Auto-Discovery
App Caching Patterns
Be Lazy
# Python
def get_user(user_id):
record = cache.get(user_id)
if record is None:
# Run a DB query
record = db.query("select * from users where id = ?", user_id)
cache.set(user_id, record)
return record
# App code
user = get_user(17)
Write On Through
# Python
def save_user(user_id, values):
record = db.query("update users ... where id = ?", user_id, values)
cache.set(user_id, record)
return record
# App code
user = save_user(17, {"name": "Nate Dogg"})
Combo Move!
def save_user(user_id, values):
record = db.query("update users ... where id = ?", user_id, values)
cache.set(user_id, record, 300) # TTL
return record
def get_user(user_id):
record = cache.get(user_id)
if record is None:
record = db.query("select * from users where id = ?", user_id)
cache.set(user_id, record, 300) # TTL
return record
# App code
save_user(17, {"name": "Nate Diddy"})
user = get_user(17)
Web Cache with Memcached
# Gemfile
gem 'dalli-elasticache’
# config/environments/production.rb
endpoint = “mycluster.abc123.cfg.use1.cache.amazonaws.com:11211”
elasticache = Dalli::ElastiCache.new(endpoint)
config.cache_store = :dalli_store, elasticache.servers,
expires_in: 1.day, compress: true
# if you change ElastiCache cluster nodes
elasticache.refresh.client
Ruby on Rails Example
Thundering Herd
Causes
• Cold cache – app startup
• Adding / removing nodes
• Cache key expiration (TTL)
• Out of cache memory
Large # of cache misses
Spike in database load
Mitigations
• Script to populate cache
• Gradually scale nodes
• Randomize TTL values
• Monitor cache utilization
ElastiCache with
Not if I
destroy
it first!It’s
mine!
Need uniqueness + ordering
Easy with Redis Sorted Sets
ZADD "leaderboard" 1201 "Gollum”
ZADD "leaderboard" 963 "Sauron"
ZADD "leaderboard" 1092 "Bilbo"
ZADD "leaderboard" 1383 "Frodo”
ZREVRANGE "leaderboard" 0 -1
1) "Frodo"
2) "Gollum"
3) "Bilbo"
4) "Sauron”
ZREVRANK "leaderboard" "Sauron"
(integer) 3
Real-time Leaderboard!
Ex: Throttling requests to an API
Leverages Redis Counters
ELB
Externally
Facing
API
Reference: http://redis.io/commands/INCR
FUNCTION LIMIT_API_CALL(APIaccesskey)limit = HGET(APIaccesskey, “limit”)time = CURRENT_UNIX_TIME()keyname = APIaccesskey + ":” + timecount = GET(keyname)IF current != NULL && count > limit THEN
ERROR ”API request limit exceeded"ELSE
MULTIINCR(keyname)EXPIRE(keyname,10)
EXECPERFORM_API_CALL()
END
Rate Limiting
• Redis counters – increment likes/dislikes
• Redis hashes – list of everyone’s ratings
• Process with algorithm like Slope One or Jaccardian similarity
• Ruby example - https://github.com/davidcelis/recommendable
Recommendation Engines
INCR item:38927:likesHSET item:38927:ratings "Susan" 1
INCR item:38927:dislikesHSET item:38927:ratings "Tommy" -1
Chat and Messaging
• PUBLISH and SUBSCRIBE Redis commands
• Game or Mobile chat
• Server intercommunication
SUBSCRIBE chat_channel:114PUBLISH chat_channel:114 "Hello all"
["message", " chat_channel:114 ", "Hello all"]UNSUBSCRIBE chat_channel:114
ElastiCache with Redis – Development
Region
Availability Zone A Availability Zone B
Auto Scaling group
ElastiCache cluster
Availability Zone A Availability Zone B
Use Primary Endpoint
Use Read Replicas
Auto-Failover
Chooses replica with
lowest replication lag
DNS endpoint is same
Redis Multi-AZ
ElastiCache with Redis Multi-AZ
Region
Availability Zone A Availability Zone B
Auto Scaling group
ElastiCache cluster
ElastiCache with Redis Multi-AZ
Region
Availability Zone A Availability Zone B
Auto Scaling group
ElastiCache cluster
ElastiCache with Redis Multi-AZ
Region
Availability Zone A Availability Zone B
Auto Scaling group
ElastiCache cluster
ElastiCache with Redis Multi-AZ
Region
Availability Zone A Availability Zone B
Auto Scaling group
ElastiCache cluster
Redis Multi-AZ – Reads and Writes
ELB App
External APIs
Replication Group
ReadsWrites
Redis – Read/Write Connections
# Ruby example
redis_write = Redis.new(
'mygame-dev.z2vq55.ng.0001.usw2.cache.amazonaws.com')
redis_read = Redis::Distributed.new([
'mygame-dev-002.z2vq55.ng.0001.usw2.cache.amazonaws.com',
'mygame-dev-003.z2vq55.ng.0001.usw2.cache.amazonaws.com'
])
redis_write.zset("leaderboard", "nateware", 1976)
top_10 = redis_read.zrevrange("leaderboard", 0, 10)
Recap – Endpoint Autodetection
• Cluster endpoints:
aws elasticache describe-cache-clusters
--cache-cluster-id mycluster
--show-cache-node-info
• Redis read replica endpoints:
aws elasticache describe-replication-groups
--replication-group-id myredisgroup
• Can listen for SNS events: http://bit.ly/elasticache-sns
http://bit.ly/elasticache-whitepaper
Splitting Redis By Purpose
ELB App
External APIs
ReadsWrites
Replication Group
Leaderboards
Replication Group
User Profiles
Reads
Don’t Plan Ahead!!
1. Start with one Redis Multi-AZ cluster
2. Split as needed
3. Scale read load via replicas
4. Rinse, repeat
Tune It Up!
Alarms
Monitoring with CloudWatch
• CPU
• Evictions
• Memory
• Swap Usage
• Network In/Out
Key ElastiCache CloudWatch Metrics
• CPUUtilization
• Memcached – up to 90% ok
• Redis – divide by cores (ex: 90% / 4 = 22.5%)
• SwapUsage low
• CacheMisses / CacheHits Ratio low / stable
• Evictions near zero
• Exception: Russian doll caching
• CurrConnections stable
• Whitepaper: http://bit.ly/elasticache-whitepaper
Scaling Up Redis
1. Snapshot existing cluster to Amazon S3
http://bit.ly/redis-snapshot
2. Spin up new Redis cluster from snapshot
http://bit.ly/redis-seeding
3. Profit!
4. Also good for debugging copy of production data
Common Issues
DNS Caching – Redis Failover
• Failover requires updating a DNS CNAME
• Can take up to two minutes
• Watch out for app DNS caching – esp. Java!
http://bit.ly/jvm-dns
• No API for triggering Redis failover• Turn off Multi-AZ temporarily
• Promote replica to primary
• Turn on Multi-AZ
1. Forks main Redis process
2. Writes data to disk from child process
3. Continues to accept traffic on main process
4. Any key update causes a copy-on-write
5. Potentially DOUBLES memory usage by Redis
Swapping During Redis Backup (BGSAVE)
Reduce memory allocated to Redis
• Set reserved-memory field in parameter groups
• Evicts more data from memory
Use larger cache node type
• More expensive
• But no data eviction
Write-heavy apps need extra Redis memory
Swapping During Redis Backup – Solutions
Redis reserved-memory Parameter
Redis Engine Enhancements
• Only Available in Amazon ElastiCache
• Forkless backups = Lower memory usage
• If enough memory, will still fork (faster)
• Improved replica sync under heavy write loads
• Smoother failovers (PSYNC)
• Two new CloudWatch metrics
• ReplicationBytes: Number of bytes sent from primary node
• SaveInProgress: 1/0 value that indicates if save is running
• Try it today! Redis 2.8.22 or later.`
Riot Games: ElastiCache in the Wild
Tom Kerr
LEAGUE OF LEGENDS
APOLLO
APOLLO: COMMENTS ANYWHERE
APOLLO: COMMENTS ANYWHERE
APOLLO: ARCHITECTURE
Replication with automatic failover
Replication across availability zones
More snapshots, more often
LESS GOOD
Fun Stuff Deploy Stuff
GOOD
Fun Stuff Deploy Stuff
APOLLO
LEADERBOARDS
LEADERBOARDS: ARCHITECTURE
LEADERBOARDS: DATA STORE
US-WEST2:NA:3848433 37
US-WEST2:NA:3848 37433
http://redis.io/topics/memory-optimization
LEADERBOARDS
Replicas with automatic failoverBEST
PRACTICES
Manually snapshot more often
Monitor your replication metrics
Redis hash key trick
Thank you!
Nate Wiger, Principal Solutions Architect, AWS
Tom Kerr, Software Engineer, Riot Games
Remember to complete
your evaluations!