Gerardo “Gerry” Narvaja Technical Servicesfiles.meetup.com/107604/TokuMX 20130724.pdf · • TokuDB - storage engine for MySQL and MariaDB • TokuMX – high performance version

TokuMX for Dolphins Gerardo “Gerry” Narvaja

Technical Services

Who Are We?

2

•  Tokutek builds high-performance database software!

•  TokuDB - storage engine for MySQL and MariaDB

•  TokuMX – high performance version of MongoDB

Better Indexing Improves Performance

•  TokuDB® and TokuMX® use Fractal Tree®

technology: –  Internal nodes are similar to B-trees, but they also have message

buffers –  Large block size (4M) enables better compression performance

and range queries. –  Basement nodes support point queries

o  Default size 128K

–  Optimal I/O utilization: o  Reads: highly compressed data o  Writes: aggregation of multiple operations + high compression.

B-Tree

•  Rule >=

4

22

99 10

2,3,4 10,20 22,25 99

B-Tree

•  Insert 15

5

22

99 10

2,3,4 10,15,20 22,25 99

B-Tree

•  In RAM

6

22

99 10

2,3,4 10,15,20 22,25 99

B-Tree

•  Find 25

7

22

99 10

2,3,4 10,15,20 22,25 99

B-Tree

•  From disk

8

22

99 10

2,3,4 10,15,20 22,25 99

Fractal Tree Indexes

Each node has pivots & Buffers

Buffers fill as updates arrive



















Flush a buffer when it fills

A flush might take an I/O, but it does lots of useful work

More changes per write ➔ fewer changes for same write load ➔

less SSD wear

Fractal Tree Indexes: Queries

Lots of buffers have messages

But query follows root-leaf path

So every query has the most up-to-date information

Messages can be insert, update, delete

Gimme, gimme, gimme …

Storage:

MongoDB and TokuMX

19

MongoDB Storage - Overview

18

4 5555

(1,ptr5) (4,ptr1),(12,ptr8)

(19,ptr7) (10000,ptr2)

The “pointer” tells MongoDB where to look in the heap for the document.

85

40 120

(2,ptr5), (22,ptr6)

(50,ptr4) (100,ptr7) (222,ptr3)

PK index (_id + pointer) Secondary index (foo + pointer)

db.test.insert({foo:55}) db.test.ensureIndex({foo:1})

memory mapped heap

20

MongoDB Storage - Maintenance

18

4 5555

(1,ptr5) (4,ptr1),(12,ptr8)

(19,ptr7) (10000,ptr2)

• Shaded represents what is in memory. • db.test.insert({foo:1}) requires IO

85

40 120

(2,ptr5), (22,ptr6)

(50,ptr4) (100,ptr7) (222,ptr3)

PK index (_id + pointer) Secondary index (foo + pointer)

memory mapped heap

21

TokuMX = MongoDB + Fractal Tree Indexes

18

4 5555

(1,doc) (4,doc),(12,doc) (19,doc) (10000,doc)

85

40 120

(2,4), (22,12) (50,19) (100,10000) (222,1)

PK index (_id + document) Secondary index (foo + _id)

db.test.insert({foo:55}) db.test.ensureIndex({foo:1})

memory mapped heap

22

TokuMX Storage - Maintenance

18

4 5555

(1,doc) (4,doc),(12,doc) (19,doc) (10000,doc)

85

40 120

(2,4), (22,12) (50,19) (100,10000) (222,1)

PK index (_id + document) Secondary index (foo + _id)

insert messages injected

• Shaded represents what is in memory. • db.test.insert({foo:1}) does not require IO

23

Performance - Index Maintenance

•  Fractal Tree Indexes are far superior for maintaining > RAM indexes than B-trees – Message buffers delay IO and cache disruption –  Not just inserts … updates and deletes too

24

Performance - Inserts

•  100mm inserts into a collection with 3 secondary indexes

25

•  Indexed Insertion : Multikey (100 inserts per doc)

Performance - Inserts on indexed arrays

26

Performance - Replication

•  TokuMX replication allows secondary servers to process replication without IO –  Simply injecting messages into the Fractal Tree

Indexes on the secondary server –  The “Hard Work” was done on the primary

•  Uniqueness checking •  Transactional locking

–  Elimination of replication lag. •  Benchmarks to come

•  Your secondaries are fully available for read scaling!

27

Performance - Lock Refinement

•  MongoDB originally implemented a global write lock –  1 writer at a time

•  MongoDB v2.2 moved this lock to the database level –  1 writer at a time in each database

•  TokuMX performs locking at the document level

28

•  Sysbench benchmark (> RAM) –  lock refinement

introduced in v0.1.0


29

•  Sysbench loading (in-memory) –  lock refinement

introduced in v0.1.0


30

Performance - Clustered Indexes

•  In TokuMX, the primary key (_id) is clustered –  Ordered by _id, co-located with the document

•  Lookups by _id require no additional IO to retrieve the document –  MongoDB must retrieve via memory mapped heap

•  Secondary indexes can optionally be created as “clustering” –  Ordered by secondary index field(s) –  Additional copy of the document is co-located –  Lookups using this index also require no additional IO to

retrieve the document –  Good for point lookups, even better for range scans –  Compression and efficient index maintenance reduce cons

31

Performance - Large Block Size

•  Data is stored in 64K chunks (basement nodes) •  4MB of these chunks are compressed, grouped and

written as a block –  * both of these values are user definable

•  As a result, range scans perform sequential IO rather than random IO

32

Performance - Memory Management

•  Two approaches to memory management – MongoDB = memory-mapped files

•  Operating system determines what data is important

–  TokuMX = managed cache •  User defined size •  TokuMX determines what data is important

•  Run multiple TokuMX instances on a single server –  Each has it’s own fixed cache size

33

Performance - Reduced IO

•  Message based architecture of Fractal Tree Indexes allows several operations per IO –  Applied when buffer is flushed to leaf nodes – MongoDB is 1-to-1

•  Reads and writes are highly compressed –  Big/infrequent writes are flash friendly

34

–  Indexed insertion benchmark

Performance - Reduced IO

35

Performance - Shard Migration

–  In sharded collections, range queries in TokuMX are optimized thanks to the use of a clustering index for the shard key

–  Shard migration between TokuMX servers impose very low I/O overhead

–  This makes low-entropy keys good candidates for sharding

36

Compression

•  MongoDB does not offer compression –  Compressed file systems and shortened field names

•  TokuMX easily achieves 5x-10x compression –  Buy less disk or flash –  Compressed reads and writes reduce overall IO

•  TokuMX support 3 compression types –  zlib, quicklz, lzma (size vs. speed) –  all data is compressed

•  Use descriptive field names! –  They are easy to compress

37

Compression

•  Chart shows space used for 51 million mostly random documents

•  46GB vs. ~15GB

38

ACID + MVCC

•  ACID –  In MongoDB, multi-insertion operations allow for

partial success •  Asked to store 5 documents, 3 succeeded

– We offer “all or nothing” behavior –  Document level locking

•  MVCC –  In MongoDB, queries can be interrupted by writers.

•  The effect of these writers are visible to the reader

–  TokuMX offers MVCC •  Reads are consistent as of the operation start

39

Multi-statement Transactions

•  TokuMX brings the following to MongoDB –  db.runCommand({“beginTransaction”, “isolation”: “mvcc”})

–  ... perform 1 or more operations –  db.runCommand(“rollbackTransaction”) |

db.runCommand(“commitTransaction”) •  Zardosht has some great blogs –  http://www.tokutek.com/2013/04/mongodb-

transactions-yes/ –  http://www.tokutek.com/2013/04/mongodb-multi-

statement-transactions-yes-we-can/

40

New v1.0.3 - Today!

•  MongoDB to TokuMX migration tool – Mongo2toku –  Reads and replays vanilla MongoDB replication –  Allows TokuMX to sync from vanilla MongoDB

•  Leif has a great blog explaining the process –  http://www.tokutek.com/2013/07/tokumx-1-0-3-

seamless-migrations-from-mongodb/

Open Source Resources

•  Repository in GitHub –  https://github.com/tokutek

•  Google Groups –  http://groups.google.com –  tokumx-user: community users and support –  tokumx-dev: contributors

•  IRC –  #tokutek

We’ll help you to find solutions …

Time for Hands On …

Contact Information

•  Web site –  http://tokutek.com

•  IRC –  #tokutek

•  Google Groups –  http://groups.google.com –  Tokumx-user, tokumx-dev

•  GitHub –  https://github.com/Tokutek

•  Twitter –  @tokutek, @seattlegaucho

•  Email –  [email protected]

43

Thank You!

We’re Hiring!

Looking for Quality Assurance

and Support Ninjas!

* Boston Area

Documents

Gerardo “Gerry” Narvaja Technical Servicesfiles.meetup.com/107604/TokuMX 20130724.pdf · • TokuDB - storage engine for MySQL and MariaDB • TokuMX – high performance version