Cassandra Day Denver 2014: Feelin' the Flow: Analyzing Data with Spark and Cassandra

FEELIN' THEFLOW

GETTING YOUR DATA MOVINGWITH SPARK AND CASSANDRA

Presented by / October 14th, 2014Rich Beaudoin @RichGBeaudoin

ABOUT ME...Sr. Software Engineer at PearsonOrganizer of Lover of MusicAll around solid dude

Distributed Computing Denver

OVERVIEWWhat is Spark

The problem it solvesThe core concepts

Spark integration with CassandraTables as RDDsWriting RDDs to Cassandra

Question and Summary

WHAT IS SPARK?Apache Spark™ is a fast and general engine

for large-scale data processing.

Created by AMPLab at UC BerkeleyBecame Apache Top-Level Project in 2014Supports Scala, Java, and Python APIs

...so each cycle of processing incurs latency from HDFS reads

THE PROBLEM, PARTONE...

Approaches like MapReduce read from, and store to HDFS

But existing solutions allow for "fine-grained" (cell level)updates, which can complicate the handling of faults where

data needs to be rebuilt/recalculated

THE PROBLEM, PARTTWO...

Any robust, distributed data processing framework needs faulttolerance

SPARK ATTEMPTS TOADDRESS THESE TWO

PROBLEMSSolution 1: store intermediate results in memory

Solution 2: introduce a new expressive data abstraction

RDDA Resilient Distributed Dataset (RDD) is an an

immutable, partioned record that supportsbasic operations (e.g. map, filter, join). It

maintains a graph of transformations in orderto enable recovery of a lost partition

*See the RDD for more detailswhite paper

TRANSFORMATIONSAND ACTIONS

"transformation" creates another RDD, is evaluated lazily

"action" returns a value, evaluated immediately

RDDS ARE EXPRESSIVEIt turns out that coarse-grained operations cover many existing

parrallel computing cases

Consequently, the RDD abstraction can implement existingsystems like MapReduce, Pregel, Dryad, etc.

SPARK CLUSTEROVERVIEW

Spark can be run with Apache Mesos, HADOOP Yarn, or it's own standalone cluster manager

JOB SCHEDULING ANDSTAGES

SPARK AND CASSANDRAIf we can turn Cassandra data into RDDs, and RDDs into

Cassandra data, then the data can start flowing between thetwo systems and give us some insight into our data.

allows us to perform thetransformation from Cassadra table to RDD and then back

again!

The Spark Cassandra Connector

THE SETUP

FROM CASSANDRATABLE TO RDD

import org.apache.spark._import com.datastax.spark.connector._

val rdd = sc.cassandraTable("music", "albums_by_artist")

Run these commands spark-shell, requires specifying the spark-connector jar on the commandline

SIMPLE MAPREDUCE FORRDD COLUMN COUNT

val count = rdd.map(x => (x.get[String]("label"),1)).reduceByKey(_ + _)

SAVE THE RDD TOCASSANDRA

count.saveToCassandra("music", "label_count",SomeColumns("label", "count"))

CASSANDRA WITHSPARKSQL

import org.apache.spark.sql.cassandra.CassandraSQLContext

val cc = new CassandraSQLContext(sc)val rdd = cc.sql("SELECT * from music.label_count")

JOINS!!!import sqlContext.createSchemaRDDimport org.apache.spark.sql._

case class LabelCount(label: String, count: Int)case class AlbumArtist(artist: String, album: String, label: String, year: Int)case class AlbumArtistCount(artist: String, album: String, label: String, year: Int, count

val albumArtists = sc.cassandraTable[AlbumArtist]("music","albums_by_artists").cacheval labelCounts = sc.cassandraTable[LabelCount]("music", "label_count").cache

val albumsByLabelId = albumArtists.keyBy(x => x.label)val countsByLabelId = labelCounts.keyBy(x => x.label)

val joinedAlbums = albumsByLabelId.join(countsByLabelId).cacheval albumArtistCountObjects = joinedAlbums.map(x => (new AlbumArtistCount(x._2._1.artist, x

OTHER THINGS TOCHECK OUT

Spark StreamingSpark SQL

QUESTIONS?

THE ENDReferences

Resilient Distributed Datasets: A Fault-Tolerant Abstractionfor In-Memory Cluster ComputingSpark Programming GuideApache Spark WebsiteDatastax Spark Cassandra Connector Documentation

Cassandra Day Denver 2014: Feelin' the Flow: Analyzing Data with Spark and Cassandra

Technology

Spark zeppelin-cassandra at synchrotron

Spark cassandra integration 2016

Spark and cassandra (Hulu Talk)

Spark Streaming with Cassandra

Spark Cassandra 2016

Introduction to Cassandra • Why Spark - Apache Cassandra | Apache Kafka | Apache Spark · 2017. 12. 20. · • Introduction to Cassandra • Why Spark + Cassandra • Problem background

Data Driven Performance Repository to Classify and ... · MongoDB. Cluster-Python Driver. Cassandra - Python Driver. Python. Spark Cluster. Spark - Cassandra Connector. Spark - MongoDB

Introduction to Cassandra • Why Spark + Cassandra ... · • Introduction to Cassandra • Why Spark + Cassandra • Problem background and overall architecture •Implementation

Cassandra & Spark for IoT

Cassandra Day 2014: Interactive Analytics with Cassandra and Spark

Spark and Cassandra - GOTO Bloggotocon.com/dl/goto-cph-2015/slides/ArtemAliev_SolvingClassical... · Spark and Cassandra. Agenda: ... import org.apache.spark.mllib.regression.LabeledPoint

Streaming Big Data with Spark Streaming, Kafka, Cassandra and … · 2017-08-23 · Asynchronous Data Passing Kafka, Akka, Spark Fast, Low Latency, Data Locality Cassandra, Spark,

Advanced Apache Spark Meetup Data Sources API Cassandra Spark Connector Spark 1.5.1 Zeppelin 0.6.0

Harnessing Spark and Cassandra with Groovy

Spark cassandra integration, theory and practice

Cassandra + Spark + Elk

Announcing Spark Driver for Cassandra

Cassandra Summit 2014: Interactive OLAP Queries using Apache Cassandra and Spark

Manchester Hadoop Meetup: Spark Cassandra Integration

Cassandra Data Maintenance with Spark