31
Shunsuke Nakamura / @sunsuk7tp Tokyo Institute of Technology Master Course Tokyo, Japan

MyCassandra (Full English Version)

Embed Size (px)

DESCRIPTION

This is MyCassandra document (English Version).

Citation preview

Page 1: MyCassandra (Full English Version)

Shunsuke Nakamura / @sunsuk7tp

Tokyo Institute of Technology Master Course

Tokyo, Japan

Page 2: MyCassandra (Full English Version)

Update latency in write-heavy workload

Read latency in read-heavy workload

Bet

ter

read-optimized

write- optimized

read-optimized

write-optimized

Page 3: MyCassandra (Full English Version)

The storage engine determines which workload a data store treats efficiently.

The distribution architecture of a data store is independent of the performance characteristics of read and write.

For example, if the storage part is excanged with MySQL, what does the characteristics of read and write change?

performance storage engine distribution

Apache HBase write optimized Bigtable like centralized

Apache Cassandra write optimized Bigtable like decentralized

Sharded MySQL read optimized MySQL centralized

Yahoo! Sherpa read optimized MySQL centralized

Page 4: MyCassandra (Full English Version)

What is MyCassandra?

Page 5: MyCassandra (Full English Version)

= Dynamo + Bigtable

Page 6: MyCassandra (Full English Version)

= Dynamo + Bigtable

distribution (P2P/decentralized) storage engine

Page 7: MyCassandra (Full English Version)

= Dynamo +

distribution (P2P/decentralized) storage engine

Page 8: MyCassandra (Full English Version)

= Dynamo + MySQL

Bigtable Redis

: storage engine

Page 9: MyCassandra (Full English Version)

MyCassandra is a modular distributed data store.  You can select a storage engine by a keyspace.

  Index algorithm  Read-optimized vs. write-optimized  Sequential or Random

 Volatile or persistence  Your experience for the storage engine

Page 10: MyCassandra (Full English Version)

  MySQL (B+-Trees)   read-optimized.

  Bigtable (LSM-Tree)   write-optimized. Cassandra’s original

  Redis (hash)   on-memory and asynchronous snapshot

  MongoDB (B-Tree)   schema-less document oriented db

  KyotoCabinet (hash/B+-Tree)   Simple Pluggable DBM (extended TokyoCabinet)

Page 11: MyCassandra (Full English Version)

  You can adapt any data store to MyCassandra, a scalable data store. •  RDB (MySQL/PostgreSQL)

  You can apply to the apps which change I/O characteristics by a phase. •  MapReduce: Map – Shuffle - Reduce •  Full text search: crowl – indexing – search

  You can apply to any IaaS environments. •  EC2 + RDS (MyCassandra with MySQL)

Page 12: MyCassandra (Full English Version)

0

5000

10000

15000

20000

25000

30000

35000

40000

Write Only Write Heavy Read Heavy Read Only

Max. QPS for 40 Clients Bigtable

MySQL

Redis

(qps)

Better

Page 13: MyCassandra (Full English Version)

select

Page 14: MyCassandra (Full English Version)

  client •  o.a.c.cli •  o.a.c.avro/thrift

  proxy •  o.a.c.service.StorageProxy

  server •  o.a.c.service.StorageService

•  o.a.c.db.ReadVerbHandler/RowMutationVerbHandler  engine

•  o.a.c.db.Table (by a keyspace)   o.a.c.db.commitlog   o.a.c.db.ColumnFamilyStore (by a columnfamily)   o.a.c.db.engine.StorageEngineInterface ← 追加   o.a.c.db.engine.MySQLInstance, RedisInstance, MongoDBInstance, …

client proxy

server

engine

Page 15: MyCassandra (Full English Version)

  Now supporting •  put (key, cf)   Insert/Update/Delete

•  get (key) •  getRangeSlice (startWith, engWith, maxResults) •  truncate/dropTable/dropDB

  Next supporting •  secondaryIndex •  expire •  counter (Cassandra-0.8 ~)

At least, you implement this two method.

Page 16: MyCassandra (Full English Version)

  The Data model is the same as Cassandra. •  But super column is not supported now.

  Store with the same Key/Value format as SSTable •  Supporting for a NoSQL of Any data model

  NoSQL with a data model of smaller dimension than Cassandra •  Add a prefix to a primary key •  The prefix means a Keyspace/ColumnFamily name.

Page 17: MyCassandra (Full English Version)

Cassandra MySQL Redis

keyspace database db

column family table record

column field

Page 18: MyCassandra (Full English Version)

key visits plan

sato 18 Gold

suzuki 214 Bronze

key gender age region

sato male 17 [null]

suzuki female 21 Tokyo

Bigtable (Cassandra)

col col columnfamily A columnfamily B

keyspace

key values

sato gender;male;age;17

suzuki gender;female;age;21;region;Tokyo

table A table B key values

sato visits;18;plan;Gold

suzuki visits;214;plan;Bronze

RDB (MySQL)

key values

A:sato …

B:ito …

A:suzuki …

B:tanaka …

db

KVS (Redis)

database

Page 19: MyCassandra (Full English Version)

  A Key and a Value serialized a Object (now) ↓ # change easily  A column is mapped to a MySQL’s field

•  It gets smaller overhead but a schema is needed.  Add specialized column

•  For secondary search •  For range query

rowKey CF counter secondary index

token

Primary key

Serialized object

Specialized column

For secondary search

For range search

Key Value

Page 20: MyCassandra (Full English Version)

  A heterogeneous cluster •  It combines multiple types of nodes where

different storage engines are located. •  Replicas of data are located each different

storage engines. •  A proxy routes to nodes that efficiently process a

query.

W R

sync async

write query

Bigtable MySQL

W R

sync async

read query

Bigtable MySQL

Page 21: MyCassandra (Full English Version)

 MyCassandra Cluster keeps the same consistency strength with Cassandra.

Quorum Protocol: (write agrements) + (read afreements) > (replicas)

•  This protocol guarantees to get one of the most recent value.

Our system needs one node which synchronously process both read and write queries.

→ Memory-based node (Redis)

W R RW

write read

•  W: write-optimized (e.g. Bigtable) •  R: read-optimized (e.g. MySQL) •  RW: memory-based (e.g. Redis)

W R

sync async

write query

Bigtable MySQL

Page 22: MyCassandra (Full English Version)

1)  A proxy broadcasts the query to nodes.

2)  The proxy waits 3a) write success: The proxy

returns a success msg. to client. 3b) write failure: The proxy waits

for acks from total 4) the proxy

asynchronously waits for acks from the remaining

WR

Proxy

Wait for two acks for write and return

Async write

RW

Client

Nodes responsible for a record

Write Latency: max (W, RW)

•  W: write-optimized (e.g. Bigtable) •  R: read-optimized (e.g. MySQL) •  RW: memory-based (e.g. Redis)

=3, =2 W:RW:R = 1:1:1

Page 23: MyCassandra (Full English Version)

1)  A proxy sends a request to a R or RW node, a digest request to other replicas.

2)  The proxy waits for replies including the specified record.

3a) success: if the record and digests are consistent, returns the record to the client.

3b) failure or inconsistency: The proxy tries to read and collect digests until they satisfy the quorum

4)  The proxy waits from the remaining nodes after replying to the

client. If there is inconsistent, resolve it using Read Repair.

Client

Check consistency and return result

Async check consistency

Proxy

=3, =2 W:RW:R = 1:1:1

Read Latency: max (R, RW)

W R RW

Nodes responsible for a record

•  W: write-optimized (e.g. Bigtable) •  R: read-optimized (e.g. MySQL) •  RW: memory-based (e.g. Redis)

Page 24: MyCassandra (Full English Version)

0 2000 4000 6000 8000

10000 12000 14000 16000 18000 20000

Write-Only Write-Heavy Read-Heavy Read-Only

max. qps for 40 clients Cassandra MyCassandra Cluster

(query/sec)

Read Heavy Write Heavy

Better

× 1.54

× 6.53

× 0.93

×0.90

[100:0] [50:50] [5:95] [0:100] [write:read]

•  YCSB / Zipfian •  Throughput was up to 6.53 times as high as those of Cassandra. •  In Write-Heavy, there happens multiple read repairs.

Page 25: MyCassandra (Full English Version)

 MyCassandra-0.2.2 •  secondaryIndex  Apply to MySQL and MongoDB

 MyCassandra-0.3.0 •  Based on Cassandra-0.8 •  Atomic counter •  Brisk (Hadoop + Cassandra)…

Page 26: MyCassandra (Full English Version)

1.  Asynchronous deletion 2.  Engine failure detection 3.  Support for ad hoc query

Page 27: MyCassandra (Full English Version)

  Cassandra’s delete/expire operation •  Logical deletion using tombstone •  Actual deletion with SSTable compaction → This approach depends on Bigtable’s engine.

 MyCassandra (MySQL, Redis, …) •  Synchronous Deletion (now) •  Expire function works well, but data continues to exit. •  Asynchronous deletion is a heavy operation   I/O to a big table different from SSTable (It is a data subset.)

Page 28: MyCassandra (Full English Version)

  Only with storage engine failure, failure detection and the behavior of instance

  With several storage engines and a partial failure, the behavior of instance

engine

instance

detect

What should I do?

instance overall failure? Take over the other node?

instance

engine

Periodic polling

instance

engine

node down

Page 29: MyCassandra (Full English Version)

  Ad hoc query and data model •  If it does not depend on distributed archetecture, it can

be added easily.   Data model of Redis (List, Set, ..)   Document data model and ad hoc queries of MongoDB

•  But if it depends, it can not be supported.   Atomic query across multiple keys.   Join

  It is important to determine whether the query is dependent on the distributed mechanism.

Page 30: MyCassandra (Full English Version)

 github •  https://github.com/sunsuk7tp/MyCassandra/

 Twitter •  @MyCassandraJP •  @_MyCassandra # @MyCassandra had already been taken!! •  @sunsuk7tp # my private account

 Google Groups •  https://groups.google.com/group/my-cassandra

Page 31: MyCassandra (Full English Version)

Thank you !