Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
MemC3: MemCache with CLOCK and Concurrent Cuckoo Hashing
Bin Fan (CMU), Dave Andersen (CMU), Michael Kaminsky (Intel Labs)
NSDI 2013
Bin Fan @ NSDI 2013!http://www.pdl.cmu.edu/ 1
Goal: Improve Memcached
1. Reduce space overhead (bytes/key)
2. Improve performance (queries/sec)
Bin Fan @ NSDI 2013!http://www.pdl.cmu.edu/ 2
Overview • Previous Work: Sharding
• Avoid inter-thread synchronization – e.g., dedicated cores [Berezecki11]
• Hotspot? Memory Efficiency?
• Our Approach: Algorithm Engineering • Apply concurrent / space-efficient data structures
Bin Fan @ NSDI 2013!http://www.pdl.cmu.edu/ 3
MemC3 vs. Memcached 3x throughput 30% more small key-value items
Memcached Overview
Bin Fan @ NSDI 2013!http://www.pdl.cmu.edu/ 4
• A DRAM-based key-value store • GET(key) • SET(key, value)
• LRU eviction for high hit rate
• Typical use: • Speed up webservers • Alleviate db load
Webserver Webserver
Memcachedserver
Webserver
get(x) set(y,"123") get(z)
Database
Memcachedserver
on misson miss
Typical Workloads
• Watch the next talk!
• Often used for small objects (Facebook[Atikoglu12]) – 90% keys < 31 bytes – Some apps only use 2-byte values
• Tens of millions of queries per second for large memcached clusters (Facebook[Nishtala13])
Bin Fan @ NSDI 2013!http://www.pdl.cmu.edu/ 5
Small Objects, High Rate
Bin Fan © April 13!!
http://www.pdl.cmu.edu/ 6
Hash table w/ chaining
K V
K V
K V K V
K V
• Key-Value Index: – Chaining hash table
Memcached: Core Data Structures
Memcached: Core Data Structures
Bin Fan @ NSDI 2013!http://www.pdl.cmu.edu/ 7
• Key-Value Index: – Chaining hash table
Hash table w/ chaining
K V
K V
K V K V
K V
LRU header
Doubly-linked-list(for each slab)
• LRU Eviction: – Doubly-linked lists
Problems We Solve • Single-node scalability
• Accessing hash table and updating LRU are serialized
• Space overhead • 56-byte header per object
– Including 3 pointers and 1 refcount – For a 100B object, overhead > 50%
Bin Fan @ NSDI 2013!http://www.pdl.cmu.edu/ 8
Solutions Optimistic cuckoo hashing
• Better memory efficiency: 95% table occupancy • Higher concurrency: single-writer/multi-reader
CLOCK-based LRU eviction • Better space efficiency and concurrency
Additional algo & tuning improvements
Bin Fan @ NSDI 2013!http://www.pdl.cmu.edu/ 9
focus of this talk
described in paper
Solutions Optimistic cuckoo hashing
• Better memory efficiency: 95% table occupancy • Higher concurrency: single-writer/multi-reader
CLOCK-based LRU eviction • Better space efficiency and concurrency
Additional algo & tuning improvements
Bin Fan @ NSDI 2013!http://www.pdl.cmu.edu/ 10
focus of this talk
• Chaining items hashed in same bucket:
Good: simple (Data Structure 101) Bad: low cache locality:
(dependent pointer dereference) Bad: pointer costs space
Memcached Default Hash Table
Bin Fan @ NSDI 2013!http://www.pdl.cmu.edu/ 11
K V K V K Vlookup
Linear Probing
Bin Fan @ NSDI 2013!http://www.pdl.cmu.edu/ 12
• Probing consecutive buckets for vacancy Good: simple Good: cache friendly Bad: poor memory efficiency:
( if occupancy > 50%, lookup needs to search a long chain)
lookup
Cuckoo Hashing[Pagh04] • Each key has two candidate buckets
• Assigned by hash1(key), hash2(key) • Stored in one of its candidate buckets
• Lookup: read 2 buckets in parallel
Bin Fan @ NSDI 2013!http://www.pdl.cmu.edu/ 13
hash1(x)
hash2(x)
0!
1!
2!
3!
4!
5!
6!
7!
lookup x
Cuckoo Hashing[Pagh04] • Each key has two candidate buckets
• Assigned by hash1(key), hash2(key) • Stored in one of its candidate buckets
• Lookup: read 2 buckets in parallel
• Insert: • Perform key displacement
recursively • Still O(1) on average [Pagh04]
Bin Fan @ NSDI 2013!http://www.pdl.cmu.edu/ 14
0!
1!
2!
3!
4!
5!
6!
7!
x
hash1(x)
a
b hash2(x)
hash1(b) c
hash2(c) insert
Bin Fan @ NSDI 2013!http://www.pdl.cmu.edu/ 15
Increase Set-Associativity
01234567
x
b
a
x
01234567
a c d
fe g h
b
• 2 cacheline-sized reads per lookup • 50% space utilized
• 2 cacheline-sized reads per lookup • 95% space utilized!
Each bucket still fits in 1 cacheline
Solutions Optimistic cuckoo hashing
• Better memory efficiency: 95% table occupancy • Higher concurrency: single-writer/multi-reader
CLOCK-based LRU eviction • Better space efficiency and concurrency
Additional algo & tuning improvements
Bin Fan @ NSDI 2013!http://www.pdl.cmu.edu/ 16
focus of this talk
False Miss Problem • During insertion:
– always a “floating” item during insertion – a reader may miss
this floating item
Bin Fan @ NSDI 2013!http://www.pdl.cmu.edu/ 17
0!
1!
2!
3!
4!
5!
6!
7!
a
Insert x b
Floating Item
Our Solution: 2-Step Insert • Step1: Find a cuckoo path to an empty slot
without editing buckets
Bin Fan @ NSDI 2013!http://www.pdl.cmu.edu/ 18
0!
1!
2!
3!
4!
5!
6!
7!
a
Insert x b
Our Solution: 2-Step Insert • Step1: Find a cuckoo path to an empty slot
without editing buckets
• Step2: Move hole backwards:
Bin Fan @ NSDI 2013!http://www.pdl.cmu.edu/ 19
0!
1!
2!
3!
4!
5!
6!
7!
a
Insert x b
Our Solution: 2-Step Insert • Step1: Find a cuckoo path to an empty slot
without editing buckets
• Step2: Move hole backwards:
Bin Fan @ NSDI 2013!http://www.pdl.cmu.edu/ 20
0!
1!
2!
3!
4!
5!
6!
7!
a
Insert x
b
Our Solution: 2-Step Insert • Step1: Find a cuckoo path to an empty slot
without editing buckets
• Step2: Move hole backwards:
Bin Fan @ NSDI 2013!http://www.pdl.cmu.edu/ 21
0!
1!
2!
3!
4!
5!
6!
7!
a Insert x
b
Our Solution: 2-Step Insert • Step1: Find a cuckoo path to an empty slot
without editing buckets
• Step2: Move hole backwards:
Bin Fan @ NSDI 2013!http://www.pdl.cmu.edu/ 22
0!
1!
2!
3!
4!
5!
6!
7!
a
x
b
Only need to ensure each move is atomic w.r.t. reader
How to Ensure Atomic Move • e.g., move key “b” from bucket 4 to bucket 2
• A simple implementation:
• Our approach: Optimistic locking • Optimized for read-heavy workloads • Each key mapped to a version counter • Reader detects version change
(described in paper)
Bin Fan @ NSDI 2013!http://www.pdl.cmu.edu/ 23
0!
1!
2!
3!
4!
5!
6!
7!
b
Lock bucket 2 and 4 Move key Unlock bucket 2 and 4
How to Coordinate Writers • Simple (current) solution:
• Serialize inserts • Works fine with read-heavy workload
• Ongoing work: allow multiple writers
Bin Fan @ NSDI 2013!http://www.pdl.cmu.edu/ 24
Solutions Optimistic cuckoo hashing
• Better memory efficiency: 95% table occupancy • Higher concurrency: single-writer/multi-reader
CLOCK-based LRU eviction • Better space efficiency and concurrency
Additional algo & tuning improvements
Bin Fan @ NSDI 2013!http://www.pdl.cmu.edu/ 25
focus of this talk
2ptr/key => 1bit/key, concurrent update
Solutions Optimistic cuckoo hashing
• Better memory efficiency: 95% table occupancy • Higher concurrency: single-writer/multi-reader
CLOCK-based LRU eviction • Better space efficiency and concurrency
Additional algo & tuning improvements
Bin Fan @ NSDI 2013!http://www.pdl.cmu.edu/ 26
focus of this talk
2ptr/key => 1bit/key, concurrent update
Avoid unnecessary full-key comparisons on hash collision
Problems We Solve • Single-node scalability
• Accessing hash table and updating LRU are serialized • GET requires no mutex • Single-writer/multiple-reader
• Space overhead
• 56-byte header per object • 3 pointers + 1 refcount => 1 pointer + 1 refbit
Bin Fan @ NSDI 2013!http://www.pdl.cmu.edu/ 27
1.74% 1.9%
12.79% 14.28%
21.54%
47.89%
0%
10%
20%
30%
40%
50%
60%
Hash Table Microbenchmark
Bin Fan @ NSDI 2013!http://www.pdl.cmu.edu/ 28
Lookups all hit
6 local threads reading hash tables of ~1GB
Base Chaining w/ Bkt Lock
Optimistic Cuckoo
Base Chaining w/ Bkt Lock
Optimistic Cuckoo
Lookups all miss
Server: Low Power Xeon CPU w/ 12 cores, 12 MB L3 cache
Mill
ion
Look
ups/
sec
235%
68%
1.74% 1.9%
12.79% 14.28%
21.54%
47.89%
0%
10%
20%
30%
40%
50%
60%
Hash Table Microbenchmark
Bin Fan @ NSDI 2013!http://www.pdl.cmu.edu/ 29
Lookups all hit
6 local threads reading hash tables of ~1GB
Base Chaining w/ Bkt Lock
Optimistic Cuckoo
Base Chaining w/ Bkt Lock
Optimistic Cuckoo
Lookups all miss
Server: Low Power Xeon CPU w/ 12 cores, 12 MB L3 cache
Mill
ion
Look
ups/
sec
235%
68%
0"
1"
2"
3"
4"
5"
1" 2" 4" 6" 8" 10" 12" 14" 16"
MQPS%
Number%of%Server%Threads%%
End-to-end Performance
Bin Fan @ NSDI 2013!http://www.pdl.cmu.edu/ 30
50 remote clients generate workloads
MemC3
Sharding
Memcached
16-Byte key, 32-Byte Value, 95% GET, 5% SET, zipf distributed
max tput 4.3 MOPS
max tput 1.5 MOPS
max tput 0.6 MOPS
Conclusion • Optimistic cuckoo hashing
• High space efficiency • Optimized for read-heavy workloads • Source Code available:
github.com/efficient/libcuckoo
• MemC3 improves Memcached • 3x throughput, 30% more (small) objects • Optimistic Cuckoo Hashing, CLOCK, other system
tuning
Bin Fan @ NSDI 2013!http://www.pdl.cmu.edu/ 31
References [Atikoglu12] Workload analysis of a large-scale key- value store. [Berezecki11] Many-core key-value store [Nishtala13] Scaling Memcache at Facebook
[Pagh04] Cuckoo hashing
Bin Fan @ NSDI 2013!http://www.pdl.cmu.edu/ 32
Q & A • Thanks!
Bin Fan @ NSDI 2013!http://www.pdl.cmu.edu/ 33