36
Fusion-io Confidential—Copyright © 2013 Fusion-io, Inc. All rights reserved. Cassandra With No Moving Parts Matt Kennedy Cassandra Summit: June 12, 2013

C* Summit 2013: No moving parts. Taking advantage of Pure Speed by Matt Kennedy

Embed Size (px)

DESCRIPTION

Flash Memory technology, deployed as server-side PCIe or solid state disks (SSDs), is emerging as a critical tool for performance and efficiency in data centers of all scales. This presentation will discuss how the use of Flash impacts Cassandra deployments in terms of configuration, DRAM requirements and performance expectations. Ideas on leveraging C*'s cutting-edge data-center awareness to blend flash and disk storage nodes for cost and workload efficiency will also be shared. Flash media itself will be examined from a physical perspective to understand endurance issues. Data on write amplification under bulk-load and operational workload conditions will be presented to explain the impact to Flash of C*'s Log Structured Merge Tree architecture and the associated compactions. Finally, we will examine strategies to make Cassandra more Flash-aware using both conventional techniques as well as emerging Non-volatile memory (NVM) programming capabilities. Lessons learned from real-world customer deployments will be shared to complete this presentation.

Citation preview

Page 1: C* Summit 2013: No moving parts. Taking advantage of Pure Speed by Matt Kennedy

Fusion-io Confidential—Copyright © 2013 Fusion-io, Inc. All rights reserved.

Cassandra With No Moving Parts Matt Kennedy

Cassandra Summit: June 12, 2013

Page 2: C* Summit 2013: No moving parts. Taking advantage of Pure Speed by Matt Kennedy

Switch your database to flash now. Or you’re doing it wrong. Brian Bulkowski, Aerospike Founder and CTO

June 18, 2013 2 #Cassandra13

http://highscalability.com/blog/2012/12/10/switch-your-databases-to-flash-storage-now-or-youre-doing-it.html

Page 3: C* Summit 2013: No moving parts. Taking advantage of Pure Speed by Matt Kennedy

June 18, 2013 3 #Cassandra13

Why?

Page 4: C* Summit 2013: No moving parts. Taking advantage of Pure Speed by Matt Kennedy

Flash IOPS Drives Server Adoption

June 18, 2013 4

▸ Capacity ▸ IOPS

▸ Cost per IOP

4TB 3TB 150 200,000

$$$$ ¢¢¢¢

#Cassandra13

Page 5: C* Summit 2013: No moving parts. Taking advantage of Pure Speed by Matt Kennedy

June 18, 2013 5 #Cassandra13

What is flash?

Page 6: C* Summit 2013: No moving parts. Taking advantage of Pure Speed by Matt Kennedy

NAND Flash Memory

June 18, 2013 6

Flash is a persistent memory technology invented by Dr. Fujio Masuoka at Toshiba in 1980.

Bit Line

Source Line Word Line

Control Gate

Float Gate

N P N

#Cassandra13

Page 7: C* Summit 2013: No moving parts. Taking advantage of Pure Speed by Matt Kennedy

Consumer Volume Drives Economics

June 18, 2013 7 #Cassandra13

Page 8: C* Summit 2013: No moving parts. Taking advantage of Pure Speed by Matt Kennedy

Flash in Servers

June 18, 2013 8 #Cassandra13

Page 9: C* Summit 2013: No moving parts. Taking advantage of Pure Speed by Matt Kennedy

Direct Cut Through Architecture

June 18, 2013 #Cassandra13 9

PC

Ie

DRAM

Host CPU

App OS

LEGACY APPROACH FUSION DIRECT APPROACH

PC

Ie

SA

S

DRAM

Data path Controller

NAND

Host CPU

RAID Controller

App OS

Goal of every I/O operation to move data to/from DRAM and flash.

SC

Super Capacitors

Page 10: C* Summit 2013: No moving parts. Taking advantage of Pure Speed by Matt Kennedy

June 18, 2013 10 #Cassandra13

How can we use it in Cassandra?

Page 11: C* Summit 2013: No moving parts. Taking advantage of Pure Speed by Matt Kennedy

Cassandra I/O - Writes

June 18, 2013 11

http://www.datastax.com/docs/1.2/dml/about_writes

#Cassandra13

Page 12: C* Summit 2013: No moving parts. Taking advantage of Pure Speed by Matt Kennedy

Cassandra I/O - Reads

June 18, 2013 12

http://www.datastax.com/docs/1.2/dml/about_reads

#Cassandra13

Memory

Page 13: C* Summit 2013: No moving parts. Taking advantage of Pure Speed by Matt Kennedy

DRAM Dictates Cassandra Scaling

June 18, 2013 13

▸ Key Design Principle: ▸ Working Set < DRAM

#Cassandra13

Page 14: C* Summit 2013: No moving parts. Taking advantage of Pure Speed by Matt Kennedy

DO

LL

AR

S

Cost of DRAM Modules

0

200

400

600

800

1000

1200

1400

1600

4GB 8GB 16GB 32GB

#Cassandra13 June 18, 2013 14

$ $$ $$$

$$$$$$

Page 15: C* Summit 2013: No moving parts. Taking advantage of Pure Speed by Matt Kennedy

When do we scale out?

June 18, 2013 15

▸ A typical server…

CPU Cores: 32 with HT Memory: 128 GB

…is your working set > 128GB?

#Cassandra13

Page 16: C* Summit 2013: No moving parts. Taking advantage of Pure Speed by Matt Kennedy

Is there a better way?

June 18, 2013 16

▸ With NoSQL Databases, we tend to scale out for DRAM

Combined Resources CPU Cores: 96 Memory: 384 GB

More cores than needed to serve reads and writes.

#Cassandra13

Page 17: C* Summit 2013: No moving parts. Taking advantage of Pure Speed by Matt Kennedy

Flash Offers A New Architectural Choice

June 18, 2013 #Cassandra13 17

Milliseconds 10-3 Microseconds 10-6 Nanoseconds 10-9

CPU Cache DRAM

Disk Drives

Server-based Flash

Page 18: C* Summit 2013: No moving parts. Taking advantage of Pure Speed by Matt Kennedy

Three Deployment Options

June 18, 2013 18

1.  All Flash 2.  Data Placement (CASSANDRA-2749) 3.  Use Logical Data Centers

#Cassandra13

Page 19: C* Summit 2013: No moving parts. Taking advantage of Pure Speed by Matt Kennedy

Cassandra with All-Flash Storage

June 18, 2013 #Cassandra13 19

Step 1: Mount ioMemory at /var/lib/cassandra/data Step 2:

Page 20: C* Summit 2013: No moving parts. Taking advantage of Pure Speed by Matt Kennedy

Data Placement

June 18, 2013 20

▸  https://issues.apache.org/jira/browse/CASSANDRA-2749 •  Thanks Marcus!

▸ Takes advantage of filesystem hierarchy

▸ Use mount points to pin Keyspaces or Column Families to flash: •  /var/lib/cassandra/data/{Keyspace}/{CF}

▸ Use flash for high performance needs, disk for capacity needs

#Cassandra13

Page 21: C* Summit 2013: No moving parts. Taking advantage of Pure Speed by Matt Kennedy

Data Centers for Storage Control

June 18, 2013 21

DC1 (Interactive requests)

DC3 (High density replicas)

DC2 (Hadoop MR Jobs)

PERFORMANCE

CAPACITY/NODE

HIGH

MEDIUM

LOW

HIGH

Cassandra cluster

#Cassandra13

Page 22: C* Summit 2013: No moving parts. Taking advantage of Pure Speed by Matt Kennedy

June 18, 2013 #Cassandra13 22

The Numbers

Page 23: C* Summit 2013: No moving parts. Taking advantage of Pure Speed by Matt Kennedy

YCSB Testing Setup

June 18, 2013 23

#Cassandra13

x4 x4

YCSB Load Generator

10GB 16-cores 24GB DRAM

Workloads use uniform random key selection instead of Zipfian.

150 million 1KB records, RF=3: ~ 120GB SSTables/node

Page 24: C* Summit 2013: No moving parts. Taking advantage of Pure Speed by Matt Kennedy

YCSB: Bulk Load (CL=ALL)

June 18, 2013 #Cassandra13 24

YC

SB

IN

SE

RT

S

0  

10000  

20000  

30000  

40000  

50000  

60000  

70000  

10  

70  

130  

190  

250  

310  

370  

430  

490  

550  

610  

670  

730  

790  

850  

910  

970  

1030  

1090  

1150  

1210  

1270  

1330  

1390  

1450  

1510  

1570  

1630  

1690  

1750  

1810  

1870  

1930  

1990  

2050  

2110  

2170  

2230  

2290  

2350  

2410  

2470  

2530  

2590  

2650  

2710  

2770  

2830  

 inserts/sec  

Avg  Latency:  0.9  ms  95th  Percen?le:  1  ms  99th  Percen?le:  4  ms    

Page 25: C* Summit 2013: No moving parts. Taking advantage of Pure Speed by Matt Kennedy

95/5 R/W Uniform distribution

June 18, 2013 #Cassandra13 25

MIX

ED

OP

S/S

EC

0  

10000  

20000  

30000  

40000  

50000  

60000  

70000  

80000  

10  

30  

50  

70  

90  

110  

130  

150  

170  

190  

210  

230  

250  

270  

290  

310  

330  

350  

370  

390  

410  

430  

450  

470  

490  

510  

530  

550  

570  

590  

610  

630  

650  

670  

690  

75  threads    200  threads    300  threads  

# threads Avg Lat. 95th pctl 99th pctl

75 1.4/0.22 ms 2/0 ms 5/0 ms

200 3.1/0.19 ms 7/0 ms 13/0 ms

300 4.4/2.2 ms 11/0 ms 19/0 ms

Page 26: C* Summit 2013: No moving parts. Taking advantage of Pure Speed by Matt Kennedy

50/50 R/W Uniform distribution 10hrs

June 18, 2013 #Cassandra13 26

YC

SB

MIX

ED

OP

S/S

EC

0  

10000  

20000  

30000  

40000  

50000  

60000  

70000  

10  

730  

1450  

2170  

2890  

3610  

4330  

5050  

5770  

6490  

7210  

7930  

8650  

9370  

10090  

10810  

11530  

12250  

12970  

13690  

14410  

15130  

15850  

16570  

17290  

18010  

18730  

19450  

20170  

20890  

21610  

22330  

23050  

23770  

24490  

25210  

25930  

26650  

27370  

28091  

28811  

29531  

30251  

30971  

31691  

32411  

33131  

33851  

34571  

35291  

 mixed  ops/sec  

Update  Latency  Average:  511  µs  95th  Pctl:1  ms  99th  Pctl:  2  ms    

Read  Latency  Average:  7.0  ms  95th  Pctl:  18  ms  99th  Pctl:  42  ms    

Page 27: C* Summit 2013: No moving parts. Taking advantage of Pure Speed by Matt Kennedy

Write Amplification

June 18, 2013 27 #Cassandra13

Amplification Factor = Physical Bytes Written Workload Bytes Written

Workload Write Amp

Leveled Compaction Load (250MB tier-0)

0.8-1.2x

24-hour mixed workloads

1.2-2.1x

Size-tiered w/Major Compactions (old skool)

3-15x

Workload Type Amplification Factor

Bulk Load 14.8

Normal Operations (80/20 update/insert split)

4.2

Cassandra

Compares favorably to HBase

Page 28: C* Summit 2013: No moving parts. Taking advantage of Pure Speed by Matt Kennedy

Next Step in Flash Evolution

June 18, 2013 28

FLASH AS MEMORY

NATIVE FLASH APIs

FLASH AS DISK

#Cassandra13

Page 29: C* Summit 2013: No moving parts. Taking advantage of Pure Speed by Matt Kennedy

Rethinking Cassandra I/O

June 18, 2013 29

http://www.datastax.com/docs/1.2/dml/about_writes

Flash

#Cassandra13

Page 30: C* Summit 2013: No moving parts. Taking advantage of Pure Speed by Matt Kennedy

Rethinking Cassandra I/O

June 18, 2013 30 #Cassandra13

http://www.datastax.com/docs/1.2/dml/about_writes

Flashtable

Page 31: C* Summit 2013: No moving parts. Taking advantage of Pure Speed by Matt Kennedy

Accelerating Cassandra With Flash

June 18, 2013 31

+

#Cassandra13

NAND Flash Accelerator

Page 32: C* Summit 2013: No moving parts. Taking advantage of Pure Speed by Matt Kennedy

Real-World Cassandra on Fusion

June 18, 2013 32 #Cassandra13

Page 33: C* Summit 2013: No moving parts. Taking advantage of Pure Speed by Matt Kennedy

f u s i o n i o . c o m | R E D E F I N E W H A T ’ S P O S S I B L E

T H A N K Y O U

f u s i o n i o . c o m | R E D E F I N E W H A T ’ S P O S S I B L E

T H A N K Y O U

Page 34: C* Summit 2013: No moving parts. Taking advantage of Pure Speed by Matt Kennedy

Cassandra: ioDrive2 vs 10 disk RAID-0

June 18, 2013 34 #Cassandra13

Page 35: C* Summit 2013: No moving parts. Taking advantage of Pure Speed by Matt Kennedy

12-hour mixed read/write workload

June 18, 2013 Fusion-io Confidential 35

MIX

ED

WO

RK

LO

AD

0  

5000  

10000  

15000  

20000  

25000  

30000  

35000  

40000  

10  

880  

1750  

2620  

3490  

4360  

5230  

6100  

6970  

7840  

8710  

9580  

10450  

11320  

12190  

13060  

13930  

14800  

15670  

16540  

17410  

18280  

19150  

20020  

20890  

21760  

22630  

23500  

24370  

25240  

26110  

26980  

27850  

28720  

29590  

30460  

31331  

32201  

33071  

33941  

34811  

35681  

36551  

37421  

38291  

39161  

40031  

40901  

41771  

42641  

 CL=1  Reads   CL=Q  Reads    CL=Q  Writes  (throMled)  

Page 36: C* Summit 2013: No moving parts. Taking advantage of Pure Speed by Matt Kennedy

50/50 R/W Uniform distribution

June 18, 2013 #Cassandra13 36

YC

SB

MIX

ED

OP

S/S

EC

0  

20000  

40000  

60000  

80000  

100000  

120000  

10  

30  

50  

70  

90  

110  

130  

150  

170  

190  

210  

230  

250  

270  

290  

310  

330  

350  

370  

390  

410  

430  

450  

470  

490  

510  

530  

550  

 mixed  ops/sec  

Update  Latency  Average:  311  µs  95th  Pctl:0  ms  99th  Pctl:  1  ms    

Read  Latency  Average:  8.2  ms  95th  Pctl:  20  ms  99th  Pctl:  62  ms