96
Cassandra And Spark Dataframes Russell Spitzer Software Engineer @ Datastax

Spark Cassandra Connector Dataframes

Embed Size (px)

Citation preview

Page 1: Spark Cassandra Connector Dataframes

Cassandra And Spark Dataframes

Russell Spitzer Software Engineer @ Datastax

Page 2: Spark Cassandra Connector Dataframes

Cassandra And Spark Dataframes

Page 3: Spark Cassandra Connector Dataframes

Cassandra And Spark Dataframes

Page 4: Spark Cassandra Connector Dataframes

Cassandra And Spark Dataframes

Page 5: Spark Cassandra Connector Dataframes

Cassandra And Spark Dataframes

Page 6: Spark Cassandra Connector Dataframes

Tungsten Gives Dataframes OffHeap Power!

Can compare memory off-heap and bitwise! Code generation!

Page 7: Spark Cassandra Connector Dataframes

The Core is the Cassandra Source

https://github.com/datastax/spark-cassandra-connector/tree/master/spark-cassandra-connector/src/main/scala/org/apache/spark/sql/cassandra

/** * Implements [[BaseRelation]]]], [[InsertableRelation]]]] and [[PrunedFilteredScan]]]] * It inserts data to and scans Cassandra table. If filterPushdown is true, it pushs down * some filters to CQL * */

DataFrame

source org.apache.spark.sql.cassandra

Page 8: Spark Cassandra Connector Dataframes

The Core is the Cassandra Source

https://github.com/datastax/spark-cassandra-connector/tree/master/spark-cassandra-connector/src/main/scala/org/apache/spark/sql/cassandra

/** * Implements [[BaseRelation]]]], [[InsertableRelation]]]] and [[PrunedFilteredScan]]]] * It inserts data to and scans Cassandra table. If filterPushdown is true, it pushs down * some filters to CQL * */

DataFrameCassandraSourceRelation

CassandraTableScanRDDConfiguration

Page 9: Spark Cassandra Connector Dataframes

Configuration Can Be Done on a Per Source Level

clusterName:keyspaceName/propertyName. Example Changing Cluster/Keyspace Level Properties val conf = new SparkConf() .set("ClusterOne/spark.cassandra.input.split.size_in_mb","32") .set("default:test/spark.cassandra.input.split.size_in_mb","128")

val lastdf = sqlContext .read .format("org.apache.spark.sql.cassandra") .options(Map( "table" -> "words", "keyspace" -> "test" , "cluster" -> "ClusterOne" ) ).load()

Page 10: Spark Cassandra Connector Dataframes

Configuration Can Be Done on a Per Source Level

clusterName:keyspaceName/propertyName. Example Changing Cluster/Keyspace Level Properties val conf = new SparkConf() .set("ClusterOne/spark.cassandra.input.split.size_in_mb","32") .set("default:test/spark.cassandra.input.split.size_in_mb","128")

val lastdf = sqlContext .read .format("org.apache.spark.sql.cassandra") .options(Map( "table" -> "words", "keyspace" -> "test" , "cluster" -> "ClusterOne" ) ).load()

Namespace: ClusterOne spark.cassandra.input.split.size_in_mb=32

Page 11: Spark Cassandra Connector Dataframes

Configuration Can Be Done on a Per Source Level

clusterName:keyspaceName/propertyName. Example Changing Cluster/Keyspace Level Properties val conf = new SparkConf() .set("ClusterOne/spark.cassandra.input.split.size_in_mb","32") .set("default:test/spark.cassandra.input.split.size_in_mb","128")

val lastdf = sqlContext .read .format("org.apache.spark.sql.cassandra") .options(Map( "table" -> "words", "keyspace" -> "test" , "cluster" -> "ClusterOne" ) ).load()

Namespace: default Keyspace: test

spark.cassandra.input.split.size_in_mb=128

Namespace: ClusterOne spark.cassandra.input.split.size_in_mb=32

Page 12: Spark Cassandra Connector Dataframes

Configuration Can Be Done on a Per Source Level

clusterName:keyspaceName/propertyName. Example Changing Cluster/Keyspace Level Properties val conf = new SparkConf() .set("ClusterOne/spark.cassandra.input.split.size_in_mb","32") .set("default:test/spark.cassandra.input.split.size_in_mb","128")

val lastdf = sqlContext .read .format("org.apache.spark.sql.cassandra") .options(Map( "table" -> "words", "keyspace" -> "test" , "cluster" -> "ClusterOne" ) ).load()

Namespace: default Keyspace: test

spark.cassandra.input.split.size_in_mb=128

Namespace: ClusterOne spark.cassandra.input.split.size_in_mb=32

Page 13: Spark Cassandra Connector Dataframes

Configuration Can Be Done on a Per Source Level

clusterName:keyspaceName/propertyName. Example Changing Cluster/Keyspace Level Properties val conf = new SparkConf() .set("ClusterOne/spark.cassandra.input.split.size_in_mb","32") .set("default:test/spark.cassandra.input.split.size_in_mb","128")

val lastdf = sqlContext .read .format("org.apache.spark.sql.cassandra") .options(Map( "table" -> "words", "keyspace" -> "test" , "cluster" -> "default" ) ).load()

Namespace: default Keyspace: test

spark.cassandra.input.split.size_in_mb=128

Namespace: ClusterOne spark.cassandra.input.split.size_in_mb=32

Page 14: Spark Cassandra Connector Dataframes

Configuration Can Be Done on a Per Source Level

clusterName:keyspaceName/propertyName. Example Changing Cluster/Keyspace Level Properties val conf = new SparkConf() .set("ClusterOne/spark.cassandra.input.split.size_in_mb","32") .set("default:test/spark.cassandra.input.split.size_in_mb","128")

val lastdf = sqlContext .read .format("org.apache.spark.sql.cassandra") .options(Map( "table" -> "words", "keyspace" -> "other" , "cluster" -> "default" ) ).load()

Namespace: default Keyspace: test

spark.cassandra.input.split.size_in_mb=128

Namespace: ClusterOne spark.cassandra.input.split.size_in_mb=32

Connector Default

Page 15: Spark Cassandra Connector Dataframes

Predicate Pushdown Is Automatic!

Select * From cassandraTable where clusteringKey > 100

Page 16: Spark Cassandra Connector Dataframes

Predicate Pushdown Is Automatic!

Select * From cassandraTable where clusteringKey > 100

DataFrame DataFromC*

Filter clusteringKey > 100

Show

Page 17: Spark Cassandra Connector Dataframes

Predicate Pushdown Is Automatic!

Select * From cassandraTable where clusteringKey > 100

DataFrame DataFromC*

Filter clusteringKey > 100

Show

Catalyst

Page 18: Spark Cassandra Connector Dataframes

Predicate Pushdown Is Automatic!

Select * From cassandraTable where clusteringKey > 100

DataFrame DataFromC*

Filter clusteringKey > 100

Show

Catalyst

https://github.com/datastax/spark-cassandra-connector/blob/master/spark-cassandra-connector/src/main/scala/org/apache/spark/sql/cassandra/PredicatePushDown.scala

Page 19: Spark Cassandra Connector Dataframes

Predicate Pushdown Is Automatic!

Select * From cassandraTable where clusteringKey > 100

DataFrame DataFromC* AND

add where clause to CQL

"clusteringKey > 100"

Show

Catalyst

https://github.com/datastax/spark-cassandra-connector/blob/master/spark-cassandra-connector/src/main/scala/org/apache/spark/sql/cassandra/PredicatePushDown.scala

Page 20: Spark Cassandra Connector Dataframes

What can be pushed down?

1. Only push down no-partition key column predicates with =, >, <, >=, <= predicate 2. Only push down primary key column predicates with = or IN predicate. 3. If there are regular columns in the pushdown predicates, they should have at least one EQ

expression on an indexed column and no IN predicates. 4. All partition column predicates must be included in the predicates to be pushed down, only

the last part of the partition key can be an IN predicate. For each partition column, only one predicate is allowed.

5. For cluster column predicates, only last predicate can be non-EQ predicate including IN predicate, and preceding column predicates must be EQ predicates.

6. If there is only one cluster column predicate, the predicates could be any non-IN predicate. There is no pushdown predicates if there is any OR condition or NOT IN condition.

7. We're not allowed to push down multiple predicates for the same column if any of them is equality or IN predicate.

Page 21: Spark Cassandra Connector Dataframes

What can be pushed down?

If you could write in CQL it will get pushed down.

Page 22: Spark Cassandra Connector Dataframes

What are we Pushing Down To?

CassandraTableScanRDD

All of the underlying code is the same as with sc.cassandraTable so everything with Reading and Writing

applies

Page 23: Spark Cassandra Connector Dataframes

What are we Pushing Down To?

CassandraTableScanRDD

All of the underlying code is the same as with sc.cassandraTable so everything with Reading and Writing

applies

https://academy.datastax.com/ Watch me talk about this in the privacy of your own home!

Page 24: Spark Cassandra Connector Dataframes

How the Spark Cassandra Connector

Reads Data

Page 25: Spark Cassandra Connector Dataframes

Spark RDDs Represent a Large

Amount of Data Partitioned into Chunks

RDD

1 2 3

4 5 6

7 8 9Node 2

Node 1 Node 3

Node 4

Page 26: Spark Cassandra Connector Dataframes

Node 2

Node 1

Spark RDDs Represent a Large

Amount of Data Partitioned into Chunks

RDD

2

346

7 8 9

Node 3

Node 4

1 5

Page 27: Spark Cassandra Connector Dataframes

Node 2

Node 1

RDD

2

346

7 8 9

Node 3

Node 4

1 5

Spark RDDs Represent a Large

Amount of Data Partitioned into Chunks

Page 28: Spark Cassandra Connector Dataframes

Cassandra Data is Distributed By Token Range

Page 29: Spark Cassandra Connector Dataframes

Cassandra Data is Distributed By Token Range

0

500

Page 30: Spark Cassandra Connector Dataframes

Cassandra Data is Distributed By Token Range

0

500

999

Page 31: Spark Cassandra Connector Dataframes

Cassandra Data is Distributed By Token Range

0

500

Node 1

Node 2

Node 3

Node 4

Page 32: Spark Cassandra Connector Dataframes

Cassandra Data is Distributed By Token Range

0

500

Node 1

Node 2

Node 3

Node 4

Without vnodes

Page 33: Spark Cassandra Connector Dataframes

Cassandra Data is Distributed By Token Range

0

500

Node 1

Node 2

Node 3

Node 4

With vnodes

Page 34: Spark Cassandra Connector Dataframes

Node 1

120-220300-500780-830

0-50

spark.cassandra.input.split_size_in_mb  1

Reported  density  is  100  tokens  per  mb

The Connector Uses Information on the Node to Make Spark Partitions

Page 35: Spark Cassandra Connector Dataframes

Node 1

120-220300-500

0-50

The Connector Uses Information on the Node to Make Spark Partitions

1

780-830

spark.cassandra.input.split_size_in_mb  1

Reported  density  is  100  tokens  per  mb

Page 36: Spark Cassandra Connector Dataframes

1

Node 1

120-220

300-500

0-50

The Connector Uses Information on the Node to Make Spark Partitions

780-830

spark.cassandra.input.split_size_in_mb  1

Reported  density  is  100  tokens  per  mb

Page 37: Spark Cassandra Connector Dataframes

2

1

Node 1 300-500

0-50

The Connector Uses Information on the Node to Make Spark Partitions

780-830

spark.cassandra.input.split_size_in_mb  1

Reported  density  is  100  tokens  per  mb

Page 38: Spark Cassandra Connector Dataframes

2

1

Node 1 300-500

0-50

The Connector Uses Information on the Node to Make Spark Partitions

780-830

spark.cassandra.input.split_size_in_mb  1

Reported  density  is  100  tokens  per  mb

Page 39: Spark Cassandra Connector Dataframes

2

1

Node 1

300-400

0-50

The Connector Uses Information on the Node to Make Spark Partitions

780-830400-500

spark.cassandra.input.split_size_in_mb  1

Reported  density  is  100  tokens  per  mb

Page 40: Spark Cassandra Connector Dataframes

21

Node 1

0-50

The Connector Uses Information on the Node to Make Spark Partitions

780-830400-500

spark.cassandra.input.split_size_in_mb  1

Reported  density  is  100  tokens  per  mb

Page 41: Spark Cassandra Connector Dataframes

21

Node 1

0-50

The Connector Uses Information on the Node to Make Spark Partitions

780-830400-500

3

spark.cassandra.input.split_size_in_mb  1

Reported  density  is  100  tokens  per  mb

Page 42: Spark Cassandra Connector Dataframes

21

Node 1

0-50

The Connector Uses Information on the Node to Make Spark Partitions

780-830

3

400-500

spark.cassandra.input.split_size_in_mb  1

Reported  density  is  100  tokens  per  mb

Page 43: Spark Cassandra Connector Dataframes

21

Node 1

0-50

The Connector Uses Information on the Node to Make Spark Partitions

780-830

3

spark.cassandra.input.split_size_in_mb  1

Reported  density  is  100  tokens  per  mb

Page 44: Spark Cassandra Connector Dataframes

4

21

Node 1

0-50

The Connector Uses Information on the Node to Make Spark Partitions

780-830

3

spark.cassandra.input.split_size_in_mb  1

Reported  density  is  100  tokens  per  mb

Page 45: Spark Cassandra Connector Dataframes

4

21

Node 1

0-50

The Connector Uses Information on the Node to Make Spark Partitions

780-830

3

spark.cassandra.input.split_size_in_mb  1

Reported  density  is  100  tokens  per  mb

Page 46: Spark Cassandra Connector Dataframes

421

Node 1

The Connector Uses Information on the Node to Make Spark Partitions

3

spark.cassandra.input.split_size_in_mb  1

Reported  density  is  100  tokens  per  mb

Page 47: Spark Cassandra Connector Dataframes

4

spark.cassandra.input.page.row.size 50

Data is Retrieved Using the DataStax Java Driver

0-50780-830

Node 1

Page 48: Spark Cassandra Connector Dataframes

4

spark.cassandra.input.page.row.size 50

Data is Retrieved Using the DataStax Java Driver

0-50

780-830

Node 1

SELECT * FROM keyspace.table WHERE token(pk) > 780 and token(pk) <= 830

SELECT * FROM keyspace.table WHERE token(pk) > 0 and token(pk) <= 50

Page 49: Spark Cassandra Connector Dataframes

4

spark.cassandra.input.page.row.size 50

Data is Retrieved Using the DataStax Java Driver

0-50

780-830

Node 1

SELECT * FROM keyspace.table WHERE token(pk) > 780 and token(pk) <= 830

SELECT * FROM keyspace.table WHERE token(pk) > 0 and token(pk) <= 50

Page 50: Spark Cassandra Connector Dataframes

4

spark.cassandra.input.page.row.size 50

Data is Retrieved Using the DataStax Java Driver

0-50

780-830

Node 1

SELECT * FROM keyspace.table WHERE token(pk) > 780 and token(pk) <= 830

SELECT * FROM keyspace.table WHERE token(pk) > 0 and token(pk) <= 50

50 CQL Rows

Page 51: Spark Cassandra Connector Dataframes

4

spark.cassandra.input.page.row.size 50

Data is Retrieved Using the DataStax Java Driver

0-50

780-830

Node 1

SELECT * FROM keyspace.table WHERE token(pk) > 780 and token(pk) <= 830

SELECT * FROM keyspace.table WHERE token(pk) > 0 and token(pk) <= 50

50 CQL Rows

Page 52: Spark Cassandra Connector Dataframes

4

spark.cassandra.input.page.row.size 50

Data is Retrieved Using the DataStax Java Driver

0-50

780-830

Node 1

SELECT * FROM keyspace.table WHERE token(pk) > 780 and token(pk) <= 830

SELECT * FROM keyspace.table WHERE token(pk) > 0 and token(pk) <= 50

50 CQL Rows50 CQL Rows

Page 53: Spark Cassandra Connector Dataframes

4

spark.cassandra.input.page.row.size 50

Data is Retrieved Using the DataStax Java Driver

0-50

780-830

Node 1

SELECT * FROM keyspace.table WHERE token(pk) > 780 and token(pk) <= 830

SELECT * FROM keyspace.table WHERE token(pk) > 0 and token(pk) <= 50

50 CQL Rows50 CQL Rows

Page 54: Spark Cassandra Connector Dataframes

4

spark.cassandra.input.page.row.size 50

Data is Retrieved Using the DataStax Java Driver

0-50

780-830

Node 1

SELECT * FROM keyspace.table WHERE token(pk) > 780 and token(pk) <= 830

SELECT * FROM keyspace.table WHERE token(pk) > 0 and token(pk) <= 50

50 CQL Rows50 CQL Rows50 CQL Rows

Page 55: Spark Cassandra Connector Dataframes

4

spark.cassandra.input.page.row.size 50

Data is Retrieved Using the DataStax Java Driver

0-50

780-830

Node 1

SELECT * FROM keyspace.table WHERE token(pk) > 780 and token(pk) <= 830

SELECT * FROM keyspace.table WHERE token(pk) > 0 and token(pk) <= 50

50 CQL Rows50 CQL Rows50 CQL Rows

Page 56: Spark Cassandra Connector Dataframes

4

spark.cassandra.input.page.row.size 50

Data is Retrieved Using the DataStax Java Driver

0-50

780-830

Node 1

SELECT * FROM keyspace.table WHERE token(pk) > 780 and token(pk) <= 830

SELECT * FROM keyspace.table WHERE token(pk) > 0 and token(pk) <= 50

50 CQL Rows50 CQL Rows50 CQL Rows 50 CQL Rows

Page 57: Spark Cassandra Connector Dataframes

4

spark.cassandra.input.page.row.size 50

Data is Retrieved Using the DataStax Java Driver

0-50

780-830

Node 1

SELECT * FROM keyspace.table WHERE token(pk) > 780 and token(pk) <= 830

SELECT * FROM keyspace.table WHERE token(pk) > 0 and token(pk) <= 50

50 CQL Rows50 CQL Rows50 CQL Rows50 CQL Rows

Page 58: Spark Cassandra Connector Dataframes

4

spark.cassandra.input.page.row.size 50

Data is Retrieved Using the DataStax Java Driver

0-50

780-830

Node 1

SELECT * FROM keyspace.table WHERE token(pk) > 780 and token(pk) <= 830

SELECT * FROM keyspace.table WHERE token(pk) > 0 and token(pk) <= 50

50 CQL Rows50 CQL Rows50 CQL Rows50 CQL Rows 50 CQL Rows

Page 59: Spark Cassandra Connector Dataframes

4

spark.cassandra.input.page.row.size 50

Data is Retrieved Using the DataStax Java Driver

0-50

780-830

Node 1

SELECT * FROM keyspace.table WHERE token(pk) > 780 and token(pk) <= 830

SELECT * FROM keyspace.table WHERE token(pk) > 0 and token(pk) <= 50

50 CQL Rows50 CQL Rows50 CQL Rows50 CQL Rows50 CQL Rows

Page 60: Spark Cassandra Connector Dataframes

4

spark.cassandra.input.page.row.size 50

Data is Retrieved Using the DataStax Java Driver

0-50

780-830

Node 1

SELECT * FROM keyspace.table WHERE token(pk) > 0 and token(pk) <= 5050 CQL Rows50 CQL Rows50 CQL Rows50 CQL Rows50 CQL Rows

Page 61: Spark Cassandra Connector Dataframes

4

spark.cassandra.input.page.row.size 50

Data is Retrieved Using the DataStax Java Driver

0-50

780-830

Node 1

SELECT * FROM keyspace.table WHERE token(pk) > 0 and token(pk) <= 5050 CQL Rows50 CQL Rows50 CQL Rows50 CQL Rows50 CQL Rows 50 CQL Rows

Page 62: Spark Cassandra Connector Dataframes

4

spark.cassandra.input.page.row.size 50

Data is Retrieved Using the DataStax Java Driver

0-50

780-830

Node 1

SELECT * FROM keyspace.table WHERE token(pk) > 0 and token(pk) <= 5050 CQL Rows50 CQL Rows50 CQL Rows50 CQL Rows50 CQL Rows

50 CQL Rows

Page 63: Spark Cassandra Connector Dataframes

4

spark.cassandra.input.page.row.size 50

Data is Retrieved Using the DataStax Java Driver

0-50

780-830

Node 1

SELECT * FROM keyspace.table WHERE token(pk) > 0 and token(pk) <= 5050 CQL Rows50 CQL Rows50 CQL Rows50 CQL Rows50 CQL Rows

50 CQL Rows

50 CQL Rows50 CQL Rows50 CQL Rows50 CQL Rows

Page 64: Spark Cassandra Connector Dataframes

4

spark.cassandra.input.page.row.size 50

Data is Retrieved Using the DataStax Java Driver

0-50

780-830

Node 1

SELECT * FROM keyspace.table WHERE token(pk) > 0 and token(pk) <= 5050 CQL Rows50 CQL Rows50 CQL Rows50 CQL Rows50 CQL Rows

50 CQL Rows50 CQL Rows50 CQL Rows50 CQL Rows50 CQL Rows

Page 65: Spark Cassandra Connector Dataframes

How The Spark Cassandra Connector

Writes Data

Page 66: Spark Cassandra Connector Dataframes

Spark RDDs Represent a Large

Amount of Data Partitioned into Chunks

RDD

1 2 3

4 5 6

7 8 9Node 2

Node 1 Node 3

Node 4

Page 67: Spark Cassandra Connector Dataframes

Node 2

Node 1

Spark RDDs Represent a Large

Amount of Data Partitioned into Chunks

RDD

2

346

7 8 9

Node 3

Node 4

1 5

Page 68: Spark Cassandra Connector Dataframes

Node 2

Node 1

RDD

2

346

7 8 9

Node 3

Node 4

1 5

The Spark Cassandra Connector saveToCassandra

method can be called on almost all RDDs

rdd.saveToCassandra("Keyspace","Table")

Page 69: Spark Cassandra Connector Dataframes

Node 11

A Java Driver connection is made to the local node and a prepared statement

is built for the target table

Java Driver

Page 70: Spark Cassandra Connector Dataframes

Node 11

Batches are built from data in Spark partitions

Java Driver

1,1,1

1,2,1

2,1,1

3,8,1

3,2,1

3,4,1

3,5,1

3,1,1

1,4,1

5,4,1

2,4,1

8,4,1

9,4,1

3,9,1

Page 71: Spark Cassandra Connector Dataframes

Node 11

By default these batches only contain CQL Rows which share the same

partition key

Java Driver

1,1,1

1,2,1

2,1,1

3,8,1

3,2,1

3,4,1

3,5,1

3,1,1

1,4,1

5,4,1

2,4,1

8,4,1

9,4,1

11,4, spark.cassandra.output.batch.grouping.key  partition spark.cassandra.output.batch.size.rows                4  spark.cassandra.output.batch.buffer.size            3  spark.cassandra.output.concurrent.writes            2 spark.cassandra.output.throughput_mb_per_sec    5

3,9,1

Page 72: Spark Cassandra Connector Dataframes

Node 11 Java Driver

1,1,11,2,1

2,1,1

3,8,1

3,2,1

3,4,1

3,5,1

3,1,1

1,4,1

5,4,1

2,4,1

8,4,1

9,4,1

11,4, spark.cassandra.output.batch.grouping.key  partition spark.cassandra.output.batch.size.rows                4  spark.cassandra.output.batch.buffer.size            3  spark.cassandra.output.concurrent.writes            2 spark.cassandra.output.throughput_mb_per_sec    5

3,9,1

By default these batches only contain CQL Rows which share the same

partition key

PK=1

Page 73: Spark Cassandra Connector Dataframes

Node 11

When an element is not part of an existing batch, a new batch is started

Java Driver

1,1,1 1,2,1

2,1,1

3,8,1

3,2,1

3,4,1

3,5,1

3,1,1

1,4,1

5,4,1

2,4,1

8,4,1

9,4,1

11,4,

spark.cassandra.output.batch.grouping.key  partition spark.cassandra.output.batch.size.rows                4  spark.cassandra.output.batch.buffer.size            3  spark.cassandra.output.concurrent.writes            2 spark.cassandra.output.throughput_mb_per_sec    5

3,9,1

PK=1

Page 74: Spark Cassandra Connector Dataframes

Node 11 Java Driver

1,1,1 1,2,1

2,1,1

3,8,1

3,2,1

3,4,1

3,5,1

3,1,1

1,4,1

5,4,1

2,4,1

8,4,1

9,4,1

11,4, spark.cassandra.output.batch.grouping.key  partition spark.cassandra.output.batch.size.rows                4  spark.cassandra.output.batch.buffer.size            3  spark.cassandra.output.concurrent.writes            2 spark.cassandra.output.throughput_mb_per_sec    5

3,9,1

When an element is not part of an existing batch, a new batch is started

PK=1

PK=2

Page 75: Spark Cassandra Connector Dataframes

Node 11 Java Driver

1,1,1 1,2,1

2,1,1

3,8,1

3,2,1

3,4,1

3,5,1

3,1,1

1,4,1

5,4,1

2,4,1

8,4,1

9,4,1

11,4, spark.cassandra.output.batch.grouping.key  partition spark.cassandra.output.batch.size.rows                4  spark.cassandra.output.batch.buffer.size            3  spark.cassandra.output.concurrent.writes            2 spark.cassandra.output.throughput_mb_per_sec    5

3,9,1

When an element is not part of an existing batch, a new batch is started

PK=1

PK=2

Page 76: Spark Cassandra Connector Dataframes

Node 11 Java Driver

1,1,1 1,2,1

2,1,1

3,8,13,2,1 3,4,1 3,5,1

3,1,1

1,4,1

5,4,1

2,4,1

8,4,1

9,4,1

11,4, spark.cassandra.output.batch.grouping.key  partition spark.cassandra.output.batch.size.rows                4  spark.cassandra.output.batch.buffer.size            3  spark.cassandra.output.concurrent.writes            2 spark.cassandra.output.throughput_mb_per_sec    5

3,9,1

If a batch size reaches batch.size.rows or batch.size.bytes

it is executed by the driver

PK=1

PK=2

PK=3

Page 77: Spark Cassandra Connector Dataframes

Node 11 Java Driver

1,1,1 1,2,1

2,1,1

3,8,13,2,1 3,4,1 3,5,1

3,1,1

1,4,1

5,4,1

2,4,1

8,4,1

9,4,1

11,4, spark.cassandra.output.batch.grouping.key  partition spark.cassandra.output.batch.size.rows                4  spark.cassandra.output.batch.buffer.size            3  spark.cassandra.output.concurrent.writes            2 spark.cassandra.output.throughput_mb_per_sec    5

3,9,1

PK=1

PK=2

PK=3

If a batch size reaches batch.size.rows or batch.size.bytes

it is executed by the driver

Page 78: Spark Cassandra Connector Dataframes

Node 11 Java Driver

1,1,1 1,2,1

2,1,1

1,4,1

5,4,1

2,4,1

8,4,1

9,4,1

11,4,3,9,1

3,1,1

spark.cassandra.output.batch.grouping.key  partition spark.cassandra.output.batch.size.rows                4  spark.cassandra.output.batch.buffer.size            3  spark.cassandra.output.concurrent.writes            2 spark.cassandra.output.throughput_mb_per_sec    5

If a batch size reaches batch.size.rows or batch.size.bytes

it is executed by the driver

PK=1

PK=2

Page 79: Spark Cassandra Connector Dataframes

Node 11 Java Driver

1,1,1 1,2,1

2,1,1

3,1,1

1,4,1

5,4,1

2,4,1

8,4,1

9,4,1

11,4,3,9,1 spark.cassandra.output.batch.grouping.key  partition spark.cassandra.output.batch.size.rows                4  spark.cassandra.output.batch.buffer.size            3  spark.cassandra.output.concurrent.writes            2 spark.cassandra.output.throughput_mb_per_sec    5

If a batch size reaches batch.size.rows or batch.size.bytes

it is executed by the driver

PK=1

PK=2

PK=3

Page 80: Spark Cassandra Connector Dataframes

Node 11

If more than batch.buffer.size batches are currently being made,

the largest batch is executed by the Java Driver

Java Driver

1,1,1 1,2,1

2,1,1

3,1,1

1,4,1

5,4,1

2,4,1

8,4,1

9,4,1

11,4, spark.cassandra.output.batch.grouping.key  partition spark.cassandra.output.batch.size.rows                4  spark.cassandra.output.batch.buffer.size            3  spark.cassandra.output.concurrent.writes            2 spark.cassandra.output.throughput_mb_per_sec    5

3,9,1

PK=1

PK=2

PK=3

Page 81: Spark Cassandra Connector Dataframes

Node 11 Java Driver

2,1,1

3,1,1

5,4,1

2,4,1

8,4,1

9,4,1

11,4, spark.cassandra.output.batch.grouping.key  partition spark.cassandra.output.batch.size.rows                4  spark.cassandra.output.batch.buffer.size            3  spark.cassandra.output.concurrent.writes            2 spark.cassandra.output.throughput_mb_per_sec    5

3,9,1

PK=2

PK=3

If more than batch.buffer.size batches are currently being made,

the largest batch is executed by the Java Driver

Page 82: Spark Cassandra Connector Dataframes

Node 11 Java Driver

2,1,1

3,1,1

5,4,1

2,4,1

8,4,1

9,4,1

11,4, spark.cassandra.output.batch.grouping.key  partition spark.cassandra.output.batch.size.rows                4  spark.cassandra.output.batch.buffer.size            3  spark.cassandra.output.concurrent.writes            2 spark.cassandra.output.throughput_mb_per_sec    5

3,9,1

If more than batch.buffer.size batches are currently being made,

the largest batch is executed by the Java Driver

PK=2

PK=3

PK=5

Page 83: Spark Cassandra Connector Dataframes

Node 11 Java Driver

2,1,1

3,1,1

5,4,1

2,4,1

8,4,1

9,4,1

11,4, spark.cassandra.output.batch.grouping.key  partition spark.cassandra.output.batch.size.rows                4  spark.cassandra.output.batch.buffer.size            3  spark.cassandra.output.concurrent.writes            2 spark.cassandra.output.throughput_mb_per_sec    5

3,9,1

If more than batch.buffer.size batches are currently being made,

the largest batch is executed by the Java Driver

PK=2

PK=3

PK=5

Page 84: Spark Cassandra Connector Dataframes

Node 11

If more batches are currently being executed by the Java driver than concurrent.writes, we

wait until one of the requests has been completed.

Java Driver

2,1,1

3,1,1

5,4,1

2,4,18,4,1

9,4,1

11,4, spark.cassandra.output.batch.grouping.key  partition spark.cassandra.output.batch.size.rows                4  spark.cassandra.output.batch.buffer.size            3  spark.cassandra.output.concurrent.writes            2 spark.cassandra.output.throughput_mb_per_sec    5

3,9,13,9,1

PK=2

PK=3

PK=5

Page 85: Spark Cassandra Connector Dataframes

Node 11

If more batches are currently being executed by the Java driver than concurrent.writes, we

wait until one of the requests has been completed.

Java Driver

2,1,1

3,1,1

5,4,1

2,4,18,4,1

9,4,1

11,4, spark.cassandra.output.batch.grouping.key  partition spark.cassandra.output.batch.size.rows                4  spark.cassandra.output.batch.buffer.size            3  spark.cassandra.output.concurrent.writes            2 spark.cassandra.output.throughput_mb_per_sec    5

3,9,13,9,1

Write Acknowledged PK=2

PK=3

PK=5

Page 86: Spark Cassandra Connector Dataframes

Node 11

If more batches are currently being executed by the Java driver than concurrent.writes, we

wait until one of the requests has been completed.

Java Driver

2,1,1

3,1,1

5,4,1

2,4,1

9,4,1

11,4, spark.cassandra.output.batch.grouping.key  partition spark.cassandra.output.batch.size.rows                4  spark.cassandra.output.batch.buffer.size            3  spark.cassandra.output.concurrent.writes            2 spark.cassandra.output.throughput_mb_per_sec    5

8,4,1

3,9,1

PK=2

PK=3

PK=5

Page 87: Spark Cassandra Connector Dataframes

Node 11

If more batches are currently being executed by the Java driver than concurrent.writes, we

wait until one of the requests has been completed.

Java Driver

3,1,1

5,4,1

9,4,1

11,4, spark.cassandra.output.batch.grouping.key  partition spark.cassandra.output.batch.size.rows                4  spark.cassandra.output.batch.buffer.size            3  spark.cassandra.output.concurrent.writes            2 spark.cassandra.output.throughput_mb_per_sec    5

8,4,1

3,9,1

PK=3

PK=5

Page 88: Spark Cassandra Connector Dataframes

Node 11

If more batches are currently being executed by the Java driver than concurrent.writes, we

wait until one of the requests has been completed.

Java Driver

3,1,1

5,4,1

9,4,1

11,4, spark.cassandra.output.batch.grouping.key  partition spark.cassandra.output.batch.size.rows                4  spark.cassandra.output.batch.buffer.size            3  spark.cassandra.output.concurrent.writes            2 spark.cassandra.output.throughput_mb_per_sec    5

8,4,1

3,9,1

PK=8

PK=3

PK=5

Page 89: Spark Cassandra Connector Dataframes

Node 11

If more batches are currently being executed by the Java driver than concurrent.writes, we

wait until one of the requests has been completed.

Java Driver

9,4,1

11,4, spark.cassandra.output.batch.grouping.key  partition spark.cassandra.output.batch.size.rows                4  spark.cassandra.output.batch.buffer.size            3  spark.cassandra.output.concurrent.writes            2 spark.cassandra.output.throughput_mb_per_sec    5

3,1,1

5,4,1

8,4,1

3,9,1

PK=8

PK=3

PK=5

Page 90: Spark Cassandra Connector Dataframes

Node 11

The last parameter throughput_mb_per_sec blocks further batches if we have written more than

that much in the past second.

Java Driver

9,4,1

11,4, spark.cassandra.output.batch.grouping.key  partition spark.cassandra.output.batch.size.rows                4  spark.cassandra.output.batch.buffer.size            3  spark.cassandra.output.concurrent.writes            2 spark.cassandra.output.throughput_mb_per_sec    5

3,1,1

5,4,1

8,4,1

3,9,1

PK=8

PK=3

PK=5

Page 91: Spark Cassandra Connector Dataframes

Node 11

The last parameter throughput_mb_per_sec blocks further batches if we have written more than

that much in the past second.

Java Driver

9,4,1

11,4, spark.cassandra.output.batch.grouping.key  partition spark.cassandra.output.batch.size.rows                4  spark.cassandra.output.batch.buffer.size            3  spark.cassandra.output.concurrent.writes            2 spark.cassandra.output.throughput_mb_per_sec    5

3,1,1

5,4,1

8,4,1

3,9,1

PK=8

PK=3

PK=5

Write Acknowledged

Page 92: Spark Cassandra Connector Dataframes

Node 11

The last parameter throughput_mb_per_sec blocks further batches if we have written more than

that much in the past second.

Java Driver

9,4,1

11,4, spark.cassandra.output.batch.grouping.key  partition spark.cassandra.output.batch.size.rows                4  spark.cassandra.output.batch.buffer.size            3  spark.cassandra.output.concurrent.writes            2 spark.cassandra.output.throughput_mb_per_sec    5

3,1,1

5,4,1

8,4,1

3,9,1

PK=8

PK=3

PK=5

Page 93: Spark Cassandra Connector Dataframes

Node 11

The last parameter throughput_mb_per_sec blocks further batches if we have written more than

that much in the past second.

Java Driver

9,4,1

11,4, spark.cassandra.output.batch.grouping.key  partition spark.cassandra.output.batch.size.rows                4  spark.cassandra.output.batch.buffer.size            3  spark.cassandra.output.concurrent.writes            2 spark.cassandra.output.throughput_mb_per_sec    5

3,1,1

5,4,1

8,4,1

3,9,1

PK=8

PK=3

PK=5

Write Acknowledged

Page 94: Spark Cassandra Connector Dataframes

Node 11

The last parameter throughput_mb_per_sec blocks further batches if we have written more than

that much in the past second.

Java Driver

9,4,1

11,4, spark.cassandra.output.batch.grouping.key  partition spark.cassandra.output.batch.size.rows                4  spark.cassandra.output.batch.buffer.size            3  spark.cassandra.output.concurrent.writes            2 spark.cassandra.output.throughput_mb_per_sec    5

Block

3,1,1

5,4,1

8,4,1

3,9,1

PK=8

PK=3

PK=5

Page 95: Spark Cassandra Connector Dataframes

Node 11

The last parameter throughput_mb_per_sec blocks further batches if we have written more than

that much in the past second.

Java Driver

9,4,1

11,4, spark.cassandra.output.batch.grouping.key  partition spark.cassandra.output.batch.size.rows                4  spark.cassandra.output.batch.buffer.size            3  spark.cassandra.output.concurrent.writes            2 spark.cassandra.output.throughput_mb_per_sec    5

3,1,1

5,4,1

8,4,1

3,9,1

PK=8

PK=3

PK=5

Page 96: Spark Cassandra Connector Dataframes

Thanks for Coming and I hope you Have a Great Time At C* Summit

http://cassandrasummit-datastax.com/agenda/the-spark-cassandra-connector-past-present-and-future/

Also ask these guys really hard questions

Jacek PiotrAlex