Spark + Cassandra

Carl Yeksigian

DataStax

-Fast large-scale data processing framework

-Focused on in-memory workloads

-Supports Java, Scala, and Python

-Integrated machine learning support (MLlib)

-Streaming support

-Simple developer API

Resilient Distributed Dataset (RDD)

-Presents a simple Collection API to the

developer

-Breaks full collection into partitions, which can

be operated on independently

-Knows how to recalculate itself if data is lost

-Abstracts how to complete a job from the tasks

RDD API

Partitions

-Partitions can be created so they are on the

same machine as the data

Uses for Spark with Cassandra

-Ad-hoc queries

-Joins, Unions across tables

-Rewriting tables

-Machine Learning

spark-cassandra-connector

DataStax OSS Projecthttps://github.com/datastax/spark-cassandra-connector

Spark Cassandra Connector

-Exposes Cassandra tables as RDDs

-Read from and write to Cassandra

-Data type mapping

-Scala and Java support

Spark + Bioinformatics

-ADAM is a bioinformatics project out of UC

Berkeley AMPLab

-Combines Spark + Parquet + Avrohttps://github.com/bigdatagenomics/adam

http://bdgenomics.org/

Simple Variant

case class Variant (

sampleid: String,

referencename: String,

location: Long,

allele: String)

create table adam.variants (

sampleid ascii,

referencename ascii,

location bigint,

allele ascii)

Connecting to Cassandra

import com.datastax.spark.connector._

// Spark connection options

val conf = new SparkConf(true)

.setMaster("spark://192.168.345.10:7077")

.setAppName("cassandra-demo")

.set("cassandra.connection.host", "192.168.345.10")

val sc = new SparkContext(conf)

Saving To Cassandra

val variants: RDD[VariantContext] = sc.adamVCFLoad(args(0))

variants.flatMap(getVariant)

.saveToCassandra("adam", "variants", AllColumns)

Querying Cassandra

val rdd = sc.cassandraTable("adam", "variants")

.map(r => (r.get[String]("allele"), 1L))

.reduceByKey(_ + _)

.map(r => (r._2, r._1))

.sortByKey(ascending = false)

rdd.collect()

.foreach(bc => println("%40s\t%d".format(bc._2, bc._1)))

Thanks

Acknowledgements:

Timothy Danford (AMPLab)

Matt Massie (AMPLab)

Frank Nothaft (AMPLab)

Jeff Hammerbacher (Cloudera/Mt Sinai)

Spark + Cassandra

Software

Cassandra + Spark + Elk

Advanced Apache Spark Meetup Data Sources API Cassandra Spark Connector Spark 1.5.1 Zeppelin 0.6.0

Spark cassandra integration, theory and practice

Cassandra Day SV 2014: Spark, Shark, and Apache Cassandra

Performance Analysis of Spark using k-means · like Cassandra (Spark Cassandra Connector) and R (SparkR). With Cassandra Connector, you can use Spark to access data stored in a Cassandra

Announcing Spark Driver for Cassandra

Spark Cassandra Connector Dataframes

Analytics with Spark and Cassandra

Cassandra & Spark for IoT

Chapter 1: An Introduction to SMACK...Chapter 7: Study Case 1 - Spark and Cassandra Figure 7-1. Canonical Spark Cassandra cluster Figure 7-2. Cassandra process and Spark worker one

Cassandra Data Maintenance with Spark

Cassandra Meetup: Real-time Analytics using Cassandra, Spark and Shark at Ooyala

Streaming Big Data with Spark Streaming, Kafka, Cassandra and … · 2017-08-23 · Asynchronous Data Passing Kafka, Akka, Spark Fast, Low Latency, Data Locality Cassandra, Spark,

Manchester Hadoop Meetup: Spark Cassandra Integration

Cassandra Day 2014: Interactive Analytics with Cassandra and Spark

Harnessing Spark and Cassandra with Groovy

Using Apache Spark, Apache Kafka and Apache Cassandra...USING APACHE SPARK, APACHE KAFKA AND APACHE CASSANDRA TO POWER INTELLIGENT APPLICATIONS | 02 Apache Cassandra is well known

Using Spark over Cassandra

Cassandra and Spark

Spark/Cassandra Integration Theory & Practicedoanduyhai Spark/Cassandra Integration Theory & Practice DuyHai DOAN, Technical Advocate