33
MemC3: MemCache with CLOCK and Concurrent Cuckoo Hashing Bin Fan (CMU), Dave Andersen (CMU), Michael Kaminsky (Intel Labs) NSDI 2013 Bin Fan @ NSDI 2013 http://www.pdl.cmu.edu/ 1

MemC3: MemCache with CLOCK and Concurrent Cuckoo Hashing · Overview • Previous Work: Sharding • Avoid inter-thread synchronization – e.g., dedicated cores [Berezecki11] •

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: MemC3: MemCache with CLOCK and Concurrent Cuckoo Hashing · Overview • Previous Work: Sharding • Avoid inter-thread synchronization – e.g., dedicated cores [Berezecki11] •

MemC3: MemCache with CLOCK and Concurrent Cuckoo Hashing

Bin Fan (CMU), Dave Andersen (CMU), Michael Kaminsky (Intel Labs)

NSDI 2013

Bin Fan @ NSDI 2013!http://www.pdl.cmu.edu/ 1

Page 2: MemC3: MemCache with CLOCK and Concurrent Cuckoo Hashing · Overview • Previous Work: Sharding • Avoid inter-thread synchronization – e.g., dedicated cores [Berezecki11] •

Goal: Improve Memcached

1. Reduce space overhead (bytes/key)

2. Improve performance (queries/sec)

Bin Fan @ NSDI 2013!http://www.pdl.cmu.edu/ 2

Page 3: MemC3: MemCache with CLOCK and Concurrent Cuckoo Hashing · Overview • Previous Work: Sharding • Avoid inter-thread synchronization – e.g., dedicated cores [Berezecki11] •

Overview •  Previous Work: Sharding

•  Avoid inter-thread synchronization – e.g., dedicated cores [Berezecki11]

•  Hotspot? Memory Efficiency?

•  Our Approach: Algorithm Engineering •  Apply concurrent / space-efficient data structures

Bin Fan @ NSDI 2013!http://www.pdl.cmu.edu/ 3

MemC3 vs. Memcached 3x throughput 30% more small key-value items

Page 4: MemC3: MemCache with CLOCK and Concurrent Cuckoo Hashing · Overview • Previous Work: Sharding • Avoid inter-thread synchronization – e.g., dedicated cores [Berezecki11] •

Memcached Overview

Bin Fan @ NSDI 2013!http://www.pdl.cmu.edu/ 4

•  A DRAM-based key-value store •  GET(key) •  SET(key, value)

•  LRU eviction for high hit rate

•  Typical use: •  Speed up webservers •  Alleviate db load

Webserver Webserver

Memcachedserver

Webserver

get(x) set(y,"123") get(z)

Database

Memcachedserver

on misson miss

Page 5: MemC3: MemCache with CLOCK and Concurrent Cuckoo Hashing · Overview • Previous Work: Sharding • Avoid inter-thread synchronization – e.g., dedicated cores [Berezecki11] •

Typical Workloads

•  Watch the next talk!

•  Often used for small objects (Facebook[Atikoglu12]) – 90% keys < 31 bytes – Some apps only use 2-byte values

•  Tens of millions of queries per second for large memcached clusters (Facebook[Nishtala13])

Bin Fan @ NSDI 2013!http://www.pdl.cmu.edu/ 5

Small Objects, High Rate

Page 6: MemC3: MemCache with CLOCK and Concurrent Cuckoo Hashing · Overview • Previous Work: Sharding • Avoid inter-thread synchronization – e.g., dedicated cores [Berezecki11] •

Bin Fan © April 13!!

http://www.pdl.cmu.edu/ 6

Hash table w/ chaining

K V

K V

K V K V

K V

•  Key-Value Index: –  Chaining hash table

Memcached: Core Data Structures

Page 7: MemC3: MemCache with CLOCK and Concurrent Cuckoo Hashing · Overview • Previous Work: Sharding • Avoid inter-thread synchronization – e.g., dedicated cores [Berezecki11] •

Memcached: Core Data Structures

Bin Fan @ NSDI 2013!http://www.pdl.cmu.edu/ 7

•  Key-Value Index: –  Chaining hash table

Hash table w/ chaining

K V

K V

K V K V

K V

LRU header

Doubly-linked-list(for each slab)

•  LRU Eviction: –  Doubly-linked lists

Page 8: MemC3: MemCache with CLOCK and Concurrent Cuckoo Hashing · Overview • Previous Work: Sharding • Avoid inter-thread synchronization – e.g., dedicated cores [Berezecki11] •

Problems We Solve •  Single-node scalability

•  Accessing hash table and updating LRU are serialized

•  Space overhead •  56-byte header per object

– Including 3 pointers and 1 refcount – For a 100B object, overhead > 50%

Bin Fan @ NSDI 2013!http://www.pdl.cmu.edu/ 8

Page 9: MemC3: MemCache with CLOCK and Concurrent Cuckoo Hashing · Overview • Previous Work: Sharding • Avoid inter-thread synchronization – e.g., dedicated cores [Berezecki11] •

Solutions Optimistic cuckoo hashing

•  Better memory efficiency: 95% table occupancy •  Higher concurrency: single-writer/multi-reader

CLOCK-based LRU eviction •  Better space efficiency and concurrency

Additional algo & tuning improvements

Bin Fan @ NSDI 2013!http://www.pdl.cmu.edu/ 9

focus of this talk

described in paper

Page 10: MemC3: MemCache with CLOCK and Concurrent Cuckoo Hashing · Overview • Previous Work: Sharding • Avoid inter-thread synchronization – e.g., dedicated cores [Berezecki11] •

Solutions Optimistic cuckoo hashing

•  Better memory efficiency: 95% table occupancy •  Higher concurrency: single-writer/multi-reader

CLOCK-based LRU eviction •  Better space efficiency and concurrency

Additional algo & tuning improvements

Bin Fan @ NSDI 2013!http://www.pdl.cmu.edu/ 10

focus of this talk

Page 11: MemC3: MemCache with CLOCK and Concurrent Cuckoo Hashing · Overview • Previous Work: Sharding • Avoid inter-thread synchronization – e.g., dedicated cores [Berezecki11] •

•  Chaining items hashed in same bucket:

Good: simple (Data Structure 101) Bad: low cache locality:

(dependent pointer dereference) Bad: pointer costs space

Memcached Default Hash Table

Bin Fan @ NSDI 2013!http://www.pdl.cmu.edu/ 11

K V K V K Vlookup

Page 12: MemC3: MemCache with CLOCK and Concurrent Cuckoo Hashing · Overview • Previous Work: Sharding • Avoid inter-thread synchronization – e.g., dedicated cores [Berezecki11] •

Linear Probing

Bin Fan @ NSDI 2013!http://www.pdl.cmu.edu/ 12

•  Probing consecutive buckets for vacancy Good: simple Good: cache friendly Bad: poor memory efficiency:

( if occupancy > 50%, lookup needs to search a long chain)

lookup

Page 13: MemC3: MemCache with CLOCK and Concurrent Cuckoo Hashing · Overview • Previous Work: Sharding • Avoid inter-thread synchronization – e.g., dedicated cores [Berezecki11] •

Cuckoo Hashing[Pagh04] •  Each key has two candidate buckets

•  Assigned by hash1(key), hash2(key) •  Stored in one of its candidate buckets

•  Lookup: read 2 buckets in parallel

Bin Fan @ NSDI 2013!http://www.pdl.cmu.edu/ 13

hash1(x)

hash2(x)

0!

1!

2!

3!

4!

5!

6!

7!

lookup x

Page 14: MemC3: MemCache with CLOCK and Concurrent Cuckoo Hashing · Overview • Previous Work: Sharding • Avoid inter-thread synchronization – e.g., dedicated cores [Berezecki11] •

Cuckoo Hashing[Pagh04] •  Each key has two candidate buckets

•  Assigned by hash1(key), hash2(key) •  Stored in one of its candidate buckets

•  Lookup: read 2 buckets in parallel

•  Insert: •  Perform key displacement

recursively •  Still O(1) on average [Pagh04]

Bin Fan @ NSDI 2013!http://www.pdl.cmu.edu/ 14

0!

1!

2!

3!

4!

5!

6!

7!

x

hash1(x)

a

b hash2(x)

hash1(b) c

hash2(c) insert

Page 15: MemC3: MemCache with CLOCK and Concurrent Cuckoo Hashing · Overview • Previous Work: Sharding • Avoid inter-thread synchronization – e.g., dedicated cores [Berezecki11] •

Bin Fan @ NSDI 2013!http://www.pdl.cmu.edu/ 15

Increase Set-Associativity

01234567

x

b

a

x

01234567

a c d

fe g h

b

•  2 cacheline-sized reads per lookup •  50% space utilized

•  2 cacheline-sized reads per lookup •  95% space utilized!

Each bucket still fits in 1 cacheline

Page 16: MemC3: MemCache with CLOCK and Concurrent Cuckoo Hashing · Overview • Previous Work: Sharding • Avoid inter-thread synchronization – e.g., dedicated cores [Berezecki11] •

Solutions Optimistic cuckoo hashing

•  Better memory efficiency: 95% table occupancy •  Higher concurrency: single-writer/multi-reader

CLOCK-based LRU eviction •  Better space efficiency and concurrency

Additional algo & tuning improvements

Bin Fan @ NSDI 2013!http://www.pdl.cmu.edu/ 16

focus of this talk

Page 17: MemC3: MemCache with CLOCK and Concurrent Cuckoo Hashing · Overview • Previous Work: Sharding • Avoid inter-thread synchronization – e.g., dedicated cores [Berezecki11] •

False Miss Problem •  During insertion:

–  always a “floating” item during insertion –  a reader may miss

this floating item

Bin Fan @ NSDI 2013!http://www.pdl.cmu.edu/ 17

0!

1!

2!

3!

4!

5!

6!

7!

a

Insert x b

Floating Item

Page 18: MemC3: MemCache with CLOCK and Concurrent Cuckoo Hashing · Overview • Previous Work: Sharding • Avoid inter-thread synchronization – e.g., dedicated cores [Berezecki11] •

Our Solution: 2-Step Insert •  Step1: Find a cuckoo path to an empty slot

without editing buckets

Bin Fan @ NSDI 2013!http://www.pdl.cmu.edu/ 18

0!

1!

2!

3!

4!

5!

6!

7!

a

Insert x b

Page 19: MemC3: MemCache with CLOCK and Concurrent Cuckoo Hashing · Overview • Previous Work: Sharding • Avoid inter-thread synchronization – e.g., dedicated cores [Berezecki11] •

Our Solution: 2-Step Insert •  Step1: Find a cuckoo path to an empty slot

without editing buckets

•  Step2: Move hole backwards:

Bin Fan @ NSDI 2013!http://www.pdl.cmu.edu/ 19

0!

1!

2!

3!

4!

5!

6!

7!

a

Insert x b

Page 20: MemC3: MemCache with CLOCK and Concurrent Cuckoo Hashing · Overview • Previous Work: Sharding • Avoid inter-thread synchronization – e.g., dedicated cores [Berezecki11] •

Our Solution: 2-Step Insert •  Step1: Find a cuckoo path to an empty slot

without editing buckets

•  Step2: Move hole backwards:

Bin Fan @ NSDI 2013!http://www.pdl.cmu.edu/ 20

0!

1!

2!

3!

4!

5!

6!

7!

a

Insert x

b

Page 21: MemC3: MemCache with CLOCK and Concurrent Cuckoo Hashing · Overview • Previous Work: Sharding • Avoid inter-thread synchronization – e.g., dedicated cores [Berezecki11] •

Our Solution: 2-Step Insert •  Step1: Find a cuckoo path to an empty slot

without editing buckets

•  Step2: Move hole backwards:

Bin Fan @ NSDI 2013!http://www.pdl.cmu.edu/ 21

0!

1!

2!

3!

4!

5!

6!

7!

a Insert x

b

Page 22: MemC3: MemCache with CLOCK and Concurrent Cuckoo Hashing · Overview • Previous Work: Sharding • Avoid inter-thread synchronization – e.g., dedicated cores [Berezecki11] •

Our Solution: 2-Step Insert •  Step1: Find a cuckoo path to an empty slot

without editing buckets

•  Step2: Move hole backwards:

Bin Fan @ NSDI 2013!http://www.pdl.cmu.edu/ 22

0!

1!

2!

3!

4!

5!

6!

7!

a

x

b

Only need to ensure each move is atomic w.r.t. reader

Page 23: MemC3: MemCache with CLOCK and Concurrent Cuckoo Hashing · Overview • Previous Work: Sharding • Avoid inter-thread synchronization – e.g., dedicated cores [Berezecki11] •

How to Ensure Atomic Move •  e.g., move key “b” from bucket 4 to bucket 2

•  A simple implementation:

•  Our approach: Optimistic locking •  Optimized for read-heavy workloads •  Each key mapped to a version counter •  Reader detects version change

(described in paper)

Bin Fan @ NSDI 2013!http://www.pdl.cmu.edu/ 23

0!

1!

2!

3!

4!

5!

6!

7!

b

Lock bucket 2 and 4 Move key Unlock bucket 2 and 4

Page 24: MemC3: MemCache with CLOCK and Concurrent Cuckoo Hashing · Overview • Previous Work: Sharding • Avoid inter-thread synchronization – e.g., dedicated cores [Berezecki11] •

How to Coordinate Writers •  Simple (current) solution:

•  Serialize inserts •  Works fine with read-heavy workload

•  Ongoing work: allow multiple writers

Bin Fan @ NSDI 2013!http://www.pdl.cmu.edu/ 24

Page 25: MemC3: MemCache with CLOCK and Concurrent Cuckoo Hashing · Overview • Previous Work: Sharding • Avoid inter-thread synchronization – e.g., dedicated cores [Berezecki11] •

Solutions Optimistic cuckoo hashing

•  Better memory efficiency: 95% table occupancy •  Higher concurrency: single-writer/multi-reader

CLOCK-based LRU eviction •  Better space efficiency and concurrency

Additional algo & tuning improvements

Bin Fan @ NSDI 2013!http://www.pdl.cmu.edu/ 25

focus of this talk

2ptr/key => 1bit/key, concurrent update

Page 26: MemC3: MemCache with CLOCK and Concurrent Cuckoo Hashing · Overview • Previous Work: Sharding • Avoid inter-thread synchronization – e.g., dedicated cores [Berezecki11] •

Solutions Optimistic cuckoo hashing

•  Better memory efficiency: 95% table occupancy •  Higher concurrency: single-writer/multi-reader

CLOCK-based LRU eviction •  Better space efficiency and concurrency

Additional algo & tuning improvements

Bin Fan @ NSDI 2013!http://www.pdl.cmu.edu/ 26

focus of this talk

2ptr/key => 1bit/key, concurrent update

Avoid unnecessary full-key comparisons on hash collision

Page 27: MemC3: MemCache with CLOCK and Concurrent Cuckoo Hashing · Overview • Previous Work: Sharding • Avoid inter-thread synchronization – e.g., dedicated cores [Berezecki11] •

Problems We Solve •  Single-node scalability

•  Accessing hash table and updating LRU are serialized •  GET requires no mutex •  Single-writer/multiple-reader

•  Space overhead

•  56-byte header per object •  3 pointers + 1 refcount => 1 pointer + 1 refbit

Bin Fan @ NSDI 2013!http://www.pdl.cmu.edu/ 27

Page 28: MemC3: MemCache with CLOCK and Concurrent Cuckoo Hashing · Overview • Previous Work: Sharding • Avoid inter-thread synchronization – e.g., dedicated cores [Berezecki11] •

1.74% 1.9%

12.79% 14.28%

21.54%

47.89%

0%

10%

20%

30%

40%

50%

60%

Hash Table Microbenchmark

Bin Fan @ NSDI 2013!http://www.pdl.cmu.edu/ 28

Lookups all hit

6 local threads reading hash tables of ~1GB

Base Chaining w/ Bkt Lock

Optimistic Cuckoo

Base Chaining w/ Bkt Lock

Optimistic Cuckoo

Lookups all miss

Server: Low Power Xeon CPU w/ 12 cores, 12 MB L3 cache

Mill

ion

Look

ups/

sec

235%

68%

Page 29: MemC3: MemCache with CLOCK and Concurrent Cuckoo Hashing · Overview • Previous Work: Sharding • Avoid inter-thread synchronization – e.g., dedicated cores [Berezecki11] •

1.74% 1.9%

12.79% 14.28%

21.54%

47.89%

0%

10%

20%

30%

40%

50%

60%

Hash Table Microbenchmark

Bin Fan @ NSDI 2013!http://www.pdl.cmu.edu/ 29

Lookups all hit

6 local threads reading hash tables of ~1GB

Base Chaining w/ Bkt Lock

Optimistic Cuckoo

Base Chaining w/ Bkt Lock

Optimistic Cuckoo

Lookups all miss

Server: Low Power Xeon CPU w/ 12 cores, 12 MB L3 cache

Mill

ion

Look

ups/

sec

235%

68%

Page 30: MemC3: MemCache with CLOCK and Concurrent Cuckoo Hashing · Overview • Previous Work: Sharding • Avoid inter-thread synchronization – e.g., dedicated cores [Berezecki11] •

0"

1"

2"

3"

4"

5"

1" 2" 4" 6" 8" 10" 12" 14" 16"

MQPS%

Number%of%Server%Threads%%

End-to-end Performance

Bin Fan @ NSDI 2013!http://www.pdl.cmu.edu/ 30

50 remote clients generate workloads

MemC3

Sharding

Memcached

16-Byte key, 32-Byte Value, 95% GET, 5% SET, zipf distributed

max tput 4.3 MOPS

max tput 1.5 MOPS

max tput 0.6 MOPS

Page 31: MemC3: MemCache with CLOCK and Concurrent Cuckoo Hashing · Overview • Previous Work: Sharding • Avoid inter-thread synchronization – e.g., dedicated cores [Berezecki11] •

Conclusion •  Optimistic cuckoo hashing

•  High space efficiency •  Optimized for read-heavy workloads •  Source Code available:

github.com/efficient/libcuckoo

•  MemC3 improves Memcached •  3x throughput, 30% more (small) objects •  Optimistic Cuckoo Hashing, CLOCK, other system

tuning

Bin Fan @ NSDI 2013!http://www.pdl.cmu.edu/ 31

Page 32: MemC3: MemCache with CLOCK and Concurrent Cuckoo Hashing · Overview • Previous Work: Sharding • Avoid inter-thread synchronization – e.g., dedicated cores [Berezecki11] •

References [Atikoglu12] Workload analysis of a large-scale key- value store. [Berezecki11] Many-core key-value store [Nishtala13] Scaling Memcache at Facebook

[Pagh04] Cuckoo hashing

Bin Fan @ NSDI 2013!http://www.pdl.cmu.edu/ 32

Page 33: MemC3: MemCache with CLOCK and Concurrent Cuckoo Hashing · Overview • Previous Work: Sharding • Avoid inter-thread synchronization – e.g., dedicated cores [Berezecki11] •

Q & A •  Thanks!

Bin Fan @ NSDI 2013!http://www.pdl.cmu.edu/ 33