20121024 mongodb-boston (1)

1

MongoDB and Fractal Tree® Indexes

Tim Callaghan*!VP/Engineering, Tokutek!

[email protected]!!!

MongoDB Boston 2012

* not [yet] a MongoDB expert

2

B-trees

B-tree Definition

In computer science, a B-tree is a tree data structure that keeps data sorted and allows searches,

sequential access, insertions, and deletions in logarithmic time.

http://en.wikipedia.org/wiki/B-tree

B-tree Overview

I will use a simple single-pivot example throughout this presentation

5

Basic B-tree

Internal Nodes - Path to data

Leaf Nodes - Actual Data

Pointers

Pivots

B-tree example

22

10 99

2, 3, 4 10,20 22,25 99

* Pivot Rule is >=

B-tree - insert

22

10 99

2, 3, 4 10,15,20 22,25 99

“Insert 15”

Value stored in leaf node

B-tree - search

22

10 99

2, 3, 4 10,20 22,25 99

“Find 25”

DISK

RAM

RAM

B-tree - storage

22

10 99

2, 3, 4 10,20 22,25 99

Performance is IO limited when bigger than RAM: try to fit all internal nodes and some leaf nodes

DISK

RAM

RAM

B-tree – serial insertions

22

10 99

2, 3, 4 10,20 22,25 99

Serial insertion workloads are in-memory, think MongoDB’s “_id” index

11

Fractal Tree Indexes

Fractal Tree Indexes

similar to B-trees - store data in leaf nodes - use PK for ordering

message buffer

message buffer

message buffer

All internal nodes have message buffers

different than B-trees - message buffer in all internal nodes - doesn’t need to update leaf node immediately - much larger nodes (4MB vs. 8KB*)

13

Fractal Tree Indexes – “insert 15”

22

10 99

2, 3, 4 10, 20 22, 25 99

insert(15)

No IO is required, all internal nodes usually fit in RAM

14

Fractal Tree Indexes – “find 25”

22

10 99

2, 3, 4 10 22, 25 99

insert(15)

insert(20) insert(25)

delete(3)

15


22

10 99

2, 3, 4 10 22, 25 99

insert(15)

Buffer is full, push messages down to next level.

insert(20) insert(25)

delete(3)

16


22

10 99

2, 4, 8 10, 20, 25 22, 25 99

insert(15)

Inserted 8, 20, 25. Deleted 3.

17

Fractal Tree Indexes – compression

•  Large node size (4MB) leads to high compression ratios.

•  Supports zlib, quicklz, and lzma compression algorithms.

•  Compression is generally 5x to 25x, similar to what gzip and 7z can do to your data.

•  Significantly less disk space needed •  Less writes, bigger writes •  Both of which are great for SSDs

•  Reads are highly compressed, more data per IO

18

So what does this have to do with MongoDB?

19

So what does this have to do with MongoDB?

* Watch Tyler Brock’s presentation “Indexing and Query Optimization”

20

MongoDB Storage

25

10 99

(2,ptr2), (4,ptr4)

(10,ptr10) (25,ptr25), (98,ptr98)

(101,ptr101)

85

40 120

(2,ptr10), (35,ptr101)

(55,ptr4) (90,ptr2) (2599,ptr98)

db.test.insert({foo:55}) db.test.ensureIndex({foo:1})

PK index (_id + pointer) Secondary Index (foo + pointer)

The “pointer” tells MongoDB where to look in the data files for the actual document data.

21

MongoDB Storage

25

10 99

(2,ptr2), (4,ptr4)

(10,ptr10) (25,ptr25), (98,ptr98)

(101,ptr101)

85

40 120

(2,ptr10), (35,ptr101)

(55,ptr4) (90,ptr2) (2599,ptr98)

B-trees

22

•  Tokutek’s Fractal Tree Index Implementations •  MySQL Storage Engine (TokuDB) •  BerkeleyDB API •  File System (TokuFS)

•  Recently added Fractal Tree Indexes to MongoDB 2.2

•  Existing indexes are still supported •  Source changes are available via our blog at

www.tokutek.com/tokuview •  This is a work in progress (see roadmap

slides)

Who is Tokutek and what have we done?

23

as simple as

db.test.ensureIndex({foo:1}, {v:2})

MongoDB and Fractal Tree Indexes

24

db.test.ensureIndex({foo:1},{v:2, blocksize:4194304, basementsize=131072, compression:quicklz, clustering:false})

•  Node size, defaults to 4MB.

Indexing Options #1

25


•  Basement node size, defaults to 128K. •  Smallest retrievable unit of a leaf node,

efficient point queries

Indexing Options #2

26


•  Compression algorithm, defaults to quicklz. •  Supports quicklz, lzma, zlib, and none. •  LZMA provides 40% additional compression

beyond quicklz, needs more CPU. •  Decompression is of quicklz and lzma are

similar.

Indexing Options #3

27


•  Clustering indexes store data by key and

include the entire document as the payload (rather than a pointer to the document)

•  Always “cover” a query, no need to retrieve the document data

Indexing Options #4

28

How well does it perform?

Three Benchmarks •  Benchmark 1 : Raw insertion performance •  Benchmark 2 : Insertion plus queries •  Benchmark 3 : Covered indexes vs. clustering

indexes

29

Benchmarks…

Race Results •  First Place = John •  Second Place = Tim •  Third Place = Frank

30

Benchmarks…

Race Results •  First Place = John •  Second Place = Tim •  Third Place = Frank Frank can say the following: “I finished third, but Tim was second to last.”

31

Benchmarks…

Race Results •  First Place = John •  Second Place = Tim •  Third Place = Frank Frank can say the following: “I finished third, but Tim was second to last.” Understand benchmark specifics and review all results.

32

Benchmark 1 : Overview

•  Measure single threaded insertion performance •  Document is URI (character), name (character),

origin (character), creation date (timestamp), and expiration date (timestamp)

•  Secondary indexes on URI, name, origin, expiration •  Machine specifics: – Sun x4150, (2) Xeon 5460, 8GB RAM, StorageTek

Controller (256MB, write-back), 4x10K SAS/RAID 0 – Ubuntu 10.04 Server (64-bit), ext4 filesystem – MongoDB v2.2.RC0

33

Benchmark 1 : Without Journaling

34

Benchmark 1 : With Journaling

35

Benchmark 1 : Observations

•  Fractal Tree Indexing insertion performance is 8x better than standard MongoDB indexing with journaling, and 11x without journaling

•  Fractal Tree Indexing insertion performance reaches steady state, even at 200 million insertions. MongoDB insertion performance seems to be in continual decline at only 50 million insertions

•  B-tree performance is great until the working data set > RAM

36


•  Measure single threaded insertion performance while querying for 1000 documents with a URI greater than or equal to a randomly selected value once every 60 seconds

•  Document is same as benchmark 1 •  Secondary indexes on URI, name, origin, expiration •  Fractal Tree Index on URI is clustering – clustering indexes store entire document inline – Compression controls disk usage – no need to get document data from elsewhere –  db.tokubench.ensureIndex({URI:1}, {v:2, clustering:true})

•  Same hardware as benchmark 1

37

Benchmark 2 : Insertion Performance

38

Benchmark 2 : Query Latency

39


•  Fractal Tree Indexing insertion performance is 10x better than standard MongoDB indexing

•  Fractal Tree Indexing query latency is 268x better than standard MongoDB indexing


•  Random lookups are bad

...but what about MongoDB’s covered indexes?

40


•  Same workload and hardware as benchmark 2 •  Create a MongoDB covered index on URI to

eliminate lookups in the data files. –  db.tokubench.ensureIndex({URI:1,creation:1,name:1,origin:1})

41

Benchmark 3 : Insertion Performance

42

Benchmark 3 : Query Latency

43


•  Fractal Tree Indexing insertion performance is still 3.7x better than standard MongoDB indexing

•  Fractal Tree Indexing query latency is 3.2x better than standard MongoDB indexing (although the MongoDB performance is highly variable)


•  MongoDB’s covered indexes can help a lot – But what happens when I add new fields to my

document? o Do I drop and re-create by including my new field? o Do I live without it?

– Clustered Fractal Tree Indexes keep on covering your queries!

44

Roadmap : Continuing the Implementation

•  Optimize Indexing Insert/Update/Delete Operations – Each of our secondary indexes is currently creating and

committing a transaction for each operation – A single transaction envelope will improve performance

45


•  Add Support for Parallel Array Indexes – MongoDB does not support indexing the following two

fields: o {a: [1, 2], b: [1, 2]}

– “it could get out of hand” – Ticketed on 3/24/2010,

jira.mongodb.org/browse/SERVER-826 – Benchmark coming soon…

46


•  Add Crash Safety – Our implementation is not [yet] crash safe with the

MongoDB PK/heap storage mechanism. – MongoDB journal is separate from Fractal Tree Index

logs. – Need to create a transactional envelope around both of

them

47


•  Replace MongoDB data store and PK index – A clustering index on _id eliminates the need for two

storage systems – Compression greatly reduces disk footprint – This is a large task

48

We are looking for evaluators!

Email me at [email protected]

See me after the presentation

49

Questions?

Tim Callaghan [email protected]

@tmcallaghan

More detailed benchmark information in my blogs at

www.tokutek.com/tokuview

Documents

20121024 mongodb-boston (1)