Myths of Big Partitions (Robert Stupp, DataStax) | Cassandra Summit 2016

Myths of big partitions

Robert StuppSolution Architect @ DataStax, C*-Committer@snazy

Issues with big partitions before 3.6

• Slow reads• Compaction failures• Repair failures• java.lang.OutOfMemoryError

fail fast node down(Lot of org.apache.cassandra.io.sstable.IndexInfo on heap)

SSTable Components

Primary Index

Summary

Bloom Filter

Determine whether an SSTable contains a partition bloomFilterFpChance

Partition samples minIndexInterval / maxIndexInterval

All partition keys + index samples column_index_size_in_kb

All the data

Read from an SSTable

Primary Index

Summary

Bloom Filter 1. Check whether partition is in SSTable

2. Find “nearest” partition key3. Return offset in primary index

4. Find partition5. Find clustering key6. Return offset in data file

7. Find, read and return data

Before CASSANDRA-11206

Evaluation of SSTable Components

Primary Index

Summary

Bloom Filter Off-Heap, small fine

Off-Heap, small-ish fine

On-Heap,many small objects, nested structure problematic

For CQL since #8099 fine

Primary Index File Layout

Partition Index SamplesPartition Key Partition Index SamplesPartition Key

Partition Index Samples Partition Index SamplesPartition Key Partition Index SamplesPartition Key

Partition Index Samples

”from” Summary

Sampling the Primary Index

Partition in Data file

Partition KeyOffset in SSTable Data File

column_index_size_in_kb (default: 64kB)

FirstKey

LastKey

FirstKey

LastKey

FirstKey

LastKey

FirstKey

LastKey

FirstKey

LastKey

FirstKey

LastKey

FirstKey

LastKey

How it looks on-heap

IndexedEntry

IndexInfofirstKey, lastKey, offset, width, deletionInfo

patitionKey*, offset, deletionInfo

* = technically not in IndexedEntry

IndexInfofirstKey, lastKey, offset, width, deletionInfo

Primary IndexStructure

IndexedEntry extends RowIndexEntry DeletionTime ArrayList IndexInfo per 64kB

DeletionTimeBufferClustering Kind ByteBuffer[] ByteBuffer byte[] …

BufferClustering Kind ByteBuffer[] ByteBuffer byte[] …

# of Java objects:

IndexedEntry 4IndexInfo (per 64kB) 8 + 4 * clust-key-components

(primitive fields omitted)

Primary Index - some numbers

Approximation on one 16 byte clustering-value:

Partition Size Index Size (heap) # of objects 1MB 3kB > 200 objects

4MB 11kB > 800 objects

64MB 180kB > 13,000 objects

512MB 1.4MB > 106,000 objects

2048MB 5.6MB > 424,000 objects

Disclaimer: numbers are examples and not representative

• Reads IndexedEntry w/ all IndexInfo• 2GB partition means: 32,768 IndexInfo,

424,000 objects• Binary search just needs: 15 IndexInfo (max),

O(log n) ~200 objects

SELECT foo, barFROM big_partition_tableWHERE ...

Writes – Flushes & Compactions

IndexedEntry constructed with all IndexInfoas Java object structure on heap first,

then serialized to disk

106,000objects

Compacting a 2GB partition

SSTable SSTable SSTable SSTable

SSTable

KeyCache

Remove 106,000 objects

Add424,000 objects

Construct424,000objects

Reads of big partitions – on heap

• Primary index data deserialized• Object structure added to key cache• Other entries evicted from key cache

• Also applies to compaction & repair

Flushes with big partitions – on heap

• Primary index data constructed• Object structure added to key cache

(for compactions)

• Also applies to compactions

TriviaHow many 2GB partitions fit in the key cache?

2GB partition 5.6MB

100/6 = 16

Issues w/ big partitions – TL;DR

• Amount of Java objects• Additions and evictions to/from key cache

Necessities – TL;DR

• Reduce amount of Java objects• Reduce GC pressure

• No change in sstable formati.e. files need to be binary compatible

Approach

• Omit (most) IndexInfo on heap

• Read IndexInfo only when needed• Serialize primary index via byte buffer

• Objects “never” promoted to Java old gen(hope so ;) )

Small heap (3GB) test

Before #11206 – duration: 3h, lots of GC, exhausted heap

With #11206 – duration: 1h10, few GC, moderate heap usage

java.lang.OutOfMemoryError

org.apache.cassandra.io.sstable.LargePartitionsTest

Results

• Promising!

• But:Performance regression w/ some workloads

Better Approach

• Keep IndexInfo objects for “nicely” sized partitions on-heap

• Controlled via c.yaml

Doesn’t this mean more disk I/O?

• “Hot” data already in buffer cache• No change for “cold” partitions

#11206 Benefits

• Reduced heap usage• Reduced GC pressure• Improved read and write paths• Key cache can hold “more” entries• Moved the bad partition size “barrier”

#11206 Metrics

org.apache.cassandra.metrics: type=Index,scope=RowIndexEntry

• name=IndexInfoCountHistogram - # of IndexInfo per IndexedEntry

• name=IndexInfoGetsHistogram - # of ”gets” against single IndexedEntry

• name=IndexedEntrySizeHistogram - serialized size of IndexedEntry

„After #11206, what‘s therecommended partition size?“

• It still depends – sorry• IMO we moved the “barrier”

Test with your

data modeland workload

Bad usage of large partitions

• CQL SELECT without clustering key• i.e. materialize a large partition in memory

• Using the same partition key over a long time• i.e. access many sstables

• Changes on-disk primary index format• Efficient on-disk representation• Optimized for OS page size• WIP !• Fix-Version: 4.x

Thank You!Q & A

Come to the “experts stand”

Myths of Big Partitions (Robert Stupp, DataStax) | Cassandra Summit 2016

Software

Scaling DataStax in Docker

DataStax Enterprise on Microsoft Azure

Steven shearing and tommy stupp sustainable behavior 2010

Line Pipe Theoretical Weight Chart - Stupp · PDF fileTitle: Line Pipe Theoretical Weight Chart Author: Stupp Corp Marketing Keywords: API 5L pipe; pipeline design; project pipeline;

DataStax Enterprise BBL

Develop Scalable Applications with DataStax Drivers (Alex Popescu, Bulat Shakirzyanov, DataStax) | C* Summit 2016

DataStax: Extreme Cassandra Optimization: The Sequel

Datastax enterprise presentation

DataStax | Graph Data Modeling in DataStax Enterprise (Artem Chebotko) | Cassandra Summit 2016

Stupp Coatings Capabilities Overview - · PDF file• Safeguard for FBE coating ... FBE layer Internal Flow Efficiency ... Stupp Coatings Capabilities Overview Author:

Partitions domaine public - Partitions gratuites

SQL Support in DataStax Enterprise€¦ · SQL Support in DataStax Enterprise INTRODUCTION This paper describes the Structured Query Language (SQL) support in DataStax Enterprise

DataStax TechDay - Munich 2014

Webinar | Introducing DataStax Enterprise 4.6

DataStax | Deploy DataStax Enterprise Clusters with OpsCenter (LCM) (Manikandan Srinivasan & Mike Lococo) | Cassandra Summit 2016

Stupp Corp Company Brochure 2016stuppcorp.com/assets/img/images/media/Stupp_Company_Brochure.… · Stupp Corporation is a division of Stupp Bros., Inc., a privately-owned company

DataStax Enterprise in the Field – 20160920

DataStax: Datastax Enterprise - The Multi-Model Platform

Data Pipelines with Spark & DataStax Enterprise

Mainstay DATASTAX