65
@tlberglund and Spark Cassandra

Cassandra and Spark - Tim Berglund

Embed Size (px)

Citation preview

Page 1: Cassandra and Spark - Tim Berglund

@tlberglund

andSparkCassandra

Page 2: Cassandra and Spark - Tim Berglund

@tlberglund

andSparkCassandra

Page 3: Cassandra and Spark - Tim Berglund

Cassandra

Page 4: Cassandra and Spark - Tim Berglund

• You have too much data • It changes a lot • The system is always on

When?

Page 5: Cassandra and Spark - Tim Berglund

• Distributed database • Transactional/online • Low latency, high velocity

What :

Page 6: Cassandra and Spark - Tim Berglund

CREATE TABLE credit_card_transactions( card_number VARCHAR, transaction_time TIMESTAMP, amount DECIMAL(10,2), merchant_id VARCHAR, country VARCHAR, PRIMARY KEY(card_number, transaction_time DESC));

Data Model

Page 7: Cassandra and Spark - Tim Berglund

CREATE TABLE credit_card_transactions( card_number VARCHAR, transaction_time TIMESTAMP, amount DECIMAL(10,2), merchant_id VARCHAR, country VARCHAR, PRIMARY KEY(card_number, transaction_time DESC));

Partition Key

Page 8: Cassandra and Spark - Tim Berglund

CREATE TABLE credit_card_transactions( card_number VARCHAR, transaction_time TIMESTAMP, amount DECIMAL(10,2), merchant_id VARCHAR, country VARCHAR, PRIMARY KEY(card_number, transaction_time DESC));

Partition Key

Page 9: Cassandra and Spark - Tim Berglund

CREATE TABLE credit_card_transactions( card_number VARCHAR, transaction_time TIMESTAMP, amount DECIMAL(10,2), merchant_id VARCHAR, country VARCHAR, PRIMARY KEY(card_number, transaction_time DESC));

Clustering Column

Page 10: Cassandra and Spark - Tim Berglund

CREATE TABLE credit_card_transactions( card_number VARCHAR, transaction_time TIMESTAMP, amount DECIMAL(10,2), merchant_id VARCHAR, country VARCHAR, PRIMARY KEY(card_number, transaction_time DESC));

Clustering Column

Page 11: Cassandra and Spark - Tim Berglund

CREATE TABLE credit_card_transactions( card_number VARCHAR, transaction_time TIMESTAMP, amount DECIMAL(10,2), merchant_id VARCHAR, country VARCHAR, PRIMARY KEY(card_number, transaction_time DESC));

Primary Key

Page 12: Cassandra and Spark - Tim Berglund

Data Modelcard_number xctn_time amount country merchant

5444…8389

4532…1191

5444…6027

4532…2189

13:58

13:59

13:59

14:02

5444…6027 14:03

5444…8389 14:03

$12.42

$50.38

$3.02

$12.42

€23.78

$98.03

US

US

US

US

ES

US

101

101

102

198

352

158

Page 13: Cassandra and Spark - Tim Berglund

card_number xctn_time amount country merchant

5444…8389

4532…1191

5444…6027

4532…2189

13:58

13:59

13:59

14:02

5444…6027 14:03

5444…8389 14:03

$12.42

$50.38

$3.02

$12.42

€23.78

$98.03

US

US

US

US

ES

US

101

101

102

198

352

158

Partition Key

Page 14: Cassandra and Spark - Tim Berglund

card_number xctn_time amount country merchant

5444…8389

4532…1191

5444…6027

4532…2189

13:58

13:59

13:59

14:02

5444…6027 14:03

5444…8389 14:03

$12.42

$50.38

$3.02

$12.42

€23.78

$98.03

US

US

US

US

ES

US

101

101

102

198

352

158

Clustering Column

Page 15: Cassandra and Spark - Tim Berglund

card_number xctn_time amount country merchant

5444…8389

4532…1191

5444…6027

4532…2189

13:58

13:59

13:59

14:02

5444…6027 14:03

5444…8389 14:03

$12.42

$50.38

$3.02

$12.42

€23.78

$98.03

US

US

US

US

ES

US

101

101

102

198

352

158

Primary Key

Page 16: Cassandra and Spark - Tim Berglund

card_number xctn_time amount country merchant

5444…8389

4532…1191

5444…6027

4532…2189

13:58

13:59

13:59

14:02

5444…6027 14:03

5444…8389 14:03

$12.42

$50.38

$3.02

$12.42

€23.78

$98.03

US

US

US

US

ES

US

101

101

102

198

352

158Partition

Page 17: Cassandra and Spark - Tim Berglund

card_number xctn_time amount country merchant

5444…8389

4532…1191

5444…6027

4532…2189

13:58

13:59

13:59

14:02

5444…6027 14:03

5444…8389 14:03

$12.42

$50.38

$3.02

$12.42

€23.78

$98.03

US

US

US

US

ES

US

101

101

102

198

352

158

Partition

Page 18: Cassandra and Spark - Tim Berglund

card_number xctn_time amount country merchant

5444…8389

4532…1191

5444…6027

4532…2189

13:58

13:59

13:59

14:02

5444…6027 14:03

5444…8389 14:03

$12.42

$50.38

$3.02

$12.42

€23.78

$98.03

US

US

US

US

ES

US

101

101

102

198

352

158

Partition

Page 19: Cassandra and Spark - Tim Berglund

card_number xctn_time amount country merchant

5444…8389

4532…1191

5444…6027

4532…2189

13:58

13:59

13:59

14:02

5444…6027 14:03

5444…8389 14:03

$12.42

$50.38

$3.02

$12.42

€23.78

$98.03

US

US

US

US

ES

US

101

101

102

198

352

158

Partition

Page 20: Cassandra and Spark - Tim Berglund

2000

4000

6000

8000

A000

C000

E000

0000Partitioning

5444…838914:03 $98.03 US 158

f(x) 5D93

Page 21: Cassandra and Spark - Tim Berglund

2000

4000

6000

8000

A000

C000

E000

0000Partitioning

5444…838914:03 $98.03 US 158

5D93

Page 22: Cassandra and Spark - Tim Berglund

2000

4000

6000

8000

A000

C000

E000

0000Partitioning

5444…838914:03 $98.03 US 158

f(x) 5D93

5444…8389

Page 23: Cassandra and Spark - Tim Berglund

card_number xctn_time amount country merchant

5444…8389

4532…1191

5444…6027

4532…2189

13:58

13:59

13:59

14:02

5444…6027 14:03

5444…8389 14:03

$12.42

$50.38

$3.02

$12.42

€23.78

$98.03

US

US

US

US

ES

US

101

101

102

198

352

158

Partition

Partitioning

Page 24: Cassandra and Spark - Tim Berglund

• Clumps of related data • Always on one server • Accessed by consistent hashing

Partitions

Page 25: Cassandra and Spark - Tim Berglund

• Design for low-latency, high-throughput reads and writes

• Inflexible ad-hoc queries • Very little aggregation • Very little secondary indexing

Consequences

Page 26: Cassandra and Spark - Tim Berglund

Spark

Page 27: Cassandra and Spark - Tim Berglund

You have too much data You don’t know the questions You don’t like MapReduce

You’re tired of Hadoop lol

When do I use it?

Page 28: Cassandra and Spark - Tim Berglund

Distributed, functional “In-memory” “Real-time” Next-generation MapReduce

What are they saying?

Page 29: Cassandra and Spark - Tim Berglund

Paradigm :Store collections in Resilient Distributed Datasets, or “RDDs”

Transform RDDs into other RDDs

Page 30: Cassandra and Spark - Tim Berglund

Paradigm :Store collections in Resilient Distributed Datasets, or “RDDs”

Transform RDDs into other RDDs

Aggregate RDDs with “actions”

Page 31: Cassandra and Spark - Tim Berglund

Paradigm :Store collections in Resilient Distributed Datasets, or “RDDs”

Aggregate RDDs with “actions”

Transform RDDs into other RDDs

Keep track of a graph of transformations and

actions

Page 32: Cassandra and Spark - Tim Berglund

Paradigm :Store collections in Resilient Distributed Datasets, or “RDDs”

Keep track of a graph of transformations and

actionsTransform RDDs into

other RDDsAggregate RDDs with

“actions”Distribute all of this

storage and computation

Page 33: Cassandra and Spark - Tim Berglund

Paradigm :Store collections in Resilient Distributed Datasets, or “RDDs”

Distribute all of this storage and computation

Transform RDDs into other RDDs

Aggregate RDDs with “actions”

Keep track of a graph of transformations and

actions

Page 34: Cassandra and Spark - Tim Berglund

2000

4000

6000

8000

A000

C000

E000

0000Architecture

Spark Client Spark

Context

Driver

Page 35: Cassandra and Spark - Tim Berglund

Architecture

Spark Client Spark

ContextJob

Page 36: Cassandra and Spark - Tim Berglund

Architecture

Spark Client Spark

Context

Job

Cluster Manager

Page 37: Cassandra and Spark - Tim Berglund

Architecture

Spark Client Spark

Context

Cluster Manager

JobTaskTaskTaskTask

Page 38: Cassandra and Spark - Tim Berglund

Architecture

Spark Client Spark

Context

Cluster Manager

Job

TaskTask

Task

Task

Page 39: Cassandra and Spark - Tim Berglund

Worker NodeExecutor

Task Task

Cache

Page 40: Cassandra and Spark - Tim Berglund

Worker NodeExecutor

Task Task

Cache

Executor

Task Task

CacheRDD RDD

RDD

RDDRDD

RDD

Page 41: Cassandra and Spark - Tim Berglund

What’s an RDD?

Page 42: Cassandra and Spark - Tim Berglund

What’s an RDD?

This is

Page 43: Cassandra and Spark - Tim Berglund

Bigger than a computer

What’s an RDD?

Page 44: Cassandra and Spark - Tim Berglund

Read from an input source?

What’s an RDD?

Page 45: Cassandra and Spark - Tim Berglund

The output of a pure

function?

What’s an RDD?

Page 46: Cassandra and Spark - Tim Berglund

Immutable

What’s an RDD?

Page 47: Cassandra and Spark - Tim Berglund

Typed

What’s an RDD?

Page 48: Cassandra and Spark - Tim Berglund

Ordered

What’s an RDD?

Page 49: Cassandra and Spark - Tim Berglund

Functions comprise a

graph

What’s an RDD?

Page 50: Cassandra and Spark - Tim Berglund

Lazily Evaluated

What’s an RDD?

Page 51: Cassandra and Spark - Tim Berglund

Collection of Things

What’s an RDD?

Page 52: Cassandra and Spark - Tim Berglund

Distributed, how?

Page 53: Cassandra and Spark - Tim Berglund

Distributed, how?

5444 5676 0686 8389: List(xn-1, xn-2, xn-3)

4532 4569 7030 1191: List(xn-4, xn-5, xn-6)

5444 5607 6517 6027: List(xn-7, xn-8, xn-9)

4532 4577 0122 2189: List(xn-10, xn-11, xn-12)

Hash these

Page 54: Cassandra and Spark - Tim Berglund

val conf = new SparkConf(true) .set("spark.cassandra.connection.host", “10.0.3.74”)

val sc = new SparkContext(“spark://127.0.0.1:7077”, “music”, conf) val transactions = sc. // cassandra stuff .partitionBy(new HashPartitioner(25)).persist() } }

Page 55: Cassandra and Spark - Tim Berglund

ConnectorCassandra

Page 56: Cassandra and Spark - Tim Berglund

import org.apache.spark.{SparkConf, SparkContext} import org.apache.spark.SparkContext._ import com.datastax.spark.connector._

object ScalaProgram { def main(args: Array[String]) {

val conf = new SparkConf(true) .set("spark.cassandra.connection.host", "127.0.0.1")

val sc = new SparkContext(“spark://127.0.0.1:7077”, “music”, conf) val rdd = sc.textFile(“/actual/path/raven.txt”) } }

From a Text File

Page 57: Cassandra and Spark - Tim Berglund

CREATE TABLE credit_card_transactions( card_number VARCHAR, transaction_time TIMESTAMP, amount DECIMAL(10,2), merchant_id VARCHAR, country VARCHAR, PRIMARY KEY(card_number, transaction_time DESC));

Data Model

Page 58: Cassandra and Spark - Tim Berglund

val conf = new SparkConf(true) .set("spark.cassandra.connection.host", "127.0.0.1")

val sc = new SparkContext(“spark://127.0.0.1:7077”, “music”, conf) val xctns = sc.cassandraTable(“payments”, “credit_card_transactions")

From a Table

Page 59: Cassandra and Spark - Tim Berglund

val conf = new SparkConf(true) .set("spark.cassandra.connection.host", "127.0.0.1")

val sc = new SparkContext(“spark://127.0.0.1:7077”, “music”, conf) val xctns = sc.cassandraTable(“payments”, “credit_card_transactions”) .select(“card_number”, “country”)

Projection

Page 60: Cassandra and Spark - Tim Berglund

val conf = new SparkConf(true) .set("spark.cassandra.connection.host", "127.0.0.1")

val sc = new SparkContext(“spark://127.0.0.1:7077”, “music”, conf) val countries = xctns.map(x => (x.getString(“card_number”), x.getString(“country”))

Making k / v pairs

Page 61: Cassandra and Spark - Tim Berglund

val conf = new SparkConf(true) .set("spark.cassandra.connection.host", "127.0.0.1")

val sc = new SparkContext(“spark://127.0.0.1:7077”, “music”, conf) val cCount = xctns.reduce(card, country => // cool analysis redacted! )

Count up Countries

Page 62: Cassandra and Spark - Tim Berglund

val conf = new SparkConf(true) .set("spark.cassandra.connection.host", "127.0.0.1")

val sc = new SparkContext(“spark://127.0.0.1:7077”, “music”, conf) val countries = xctns.map(x => (x._1, x._2))

Make k / v pairs

Page 63: Cassandra and Spark - Tim Berglund

val conf = new SparkConf(true) .set("spark.cassandra.connection.host", "127.0.0.1")

val sc = new SparkContext(“spark://127.0.0.1:7077”, “music”, conf) val fraud = cCount.saveToCassandra(“payments”, “frequent_countries”)

Persisting to a Table

Page 64: Cassandra and Spark - Tim Berglund
Page 65: Cassandra and Spark - Tim Berglund

Thank you!

@tlberglund