View
1.147
Download
1
Category
Preview:
DESCRIPTION
Speaker: Rich Beaudoin, Senior Software Engineer at Pearson eCollege In the world of Big Data it's crucial that your data is accessible. Cassandra provides us with a means to reliably store our data, but how can we keep it flowing? That's where Spark steps up to provide a powerful one-two punch with Cassandra to get your data flowing in all the right directions.
Citation preview
FEELIN' THEFLOW
GETTING YOUR DATA MOVINGWITH SPARK AND CASSANDRA
Presented by / October 14th, 2014Rich Beaudoin @RichGBeaudoin
ABOUT ME...Sr. Software Engineer at PearsonOrganizer of Lover of MusicAll around solid dude
Distributed Computing Denver
OVERVIEWWhat is Spark
The problem it solvesThe core concepts
Spark integration with CassandraTables as RDDsWriting RDDs to Cassandra
Question and Summary
WHAT IS SPARK?Apache Spark™ is a fast and general engine
for large-scale data processing.
Created by AMPLab at UC BerkeleyBecame Apache Top-Level Project in 2014Supports Scala, Java, and Python APIs
...so each cycle of processing incurs latency from HDFS reads
THE PROBLEM, PARTONE...
Approaches like MapReduce read from, and store to HDFS
But existing solutions allow for "fine-grained" (cell level)updates, which can complicate the handling of faults where
data needs to be rebuilt/recalculated
THE PROBLEM, PARTTWO...
Any robust, distributed data processing framework needs faulttolerance
SPARK ATTEMPTS TOADDRESS THESE TWO
PROBLEMSSolution 1: store intermediate results in memory
Solution 2: introduce a new expressive data abstraction
RDDA Resilient Distributed Dataset (RDD) is an an
immutable, partioned record that supportsbasic operations (e.g. map, filter, join). It
maintains a graph of transformations in orderto enable recovery of a lost partition
*See the RDD for more detailswhite paper
TRANSFORMATIONSAND ACTIONS
"transformation" creates another RDD, is evaluated lazily
"action" returns a value, evaluated immediately
RDDS ARE EXPRESSIVEIt turns out that coarse-grained operations cover many existing
parrallel computing cases
Consequently, the RDD abstraction can implement existingsystems like MapReduce, Pregel, Dryad, etc.
SPARK CLUSTEROVERVIEW
Spark can be run with Apache Mesos, HADOOP Yarn, or it's own standalone cluster manager
JOB SCHEDULING ANDSTAGES
SPARK AND CASSANDRAIf we can turn Cassandra data into RDDs, and RDDs into
Cassandra data, then the data can start flowing between thetwo systems and give us some insight into our data.
allows us to perform thetransformation from Cassadra table to RDD and then back
again!
The Spark Cassandra Connector
THE SETUP
FROM CASSANDRATABLE TO RDD
import org.apache.spark._import com.datastax.spark.connector._
val rdd = sc.cassandraTable("music", "albums_by_artist")
Run these commands spark-shell, requires specifying the spark-connector jar on the commandline
SIMPLE MAPREDUCE FORRDD COLUMN COUNT
val count = rdd.map(x => (x.get[String]("label"),1)).reduceByKey(_ + _)
SAVE THE RDD TOCASSANDRA
count.saveToCassandra("music", "label_count",SomeColumns("label", "count"))
CASSANDRA WITHSPARKSQL
import org.apache.spark.sql.cassandra.CassandraSQLContext
val cc = new CassandraSQLContext(sc)val rdd = cc.sql("SELECT * from music.label_count")
JOINS!!!import sqlContext.createSchemaRDDimport org.apache.spark.sql._
case class LabelCount(label: String, count: Int)case class AlbumArtist(artist: String, album: String, label: String, year: Int)case class AlbumArtistCount(artist: String, album: String, label: String, year: Int, count
val albumArtists = sc.cassandraTable[AlbumArtist]("music","albums_by_artists").cacheval labelCounts = sc.cassandraTable[LabelCount]("music", "label_count").cache
val albumsByLabelId = albumArtists.keyBy(x => x.label)val countsByLabelId = labelCounts.keyBy(x => x.label)
val joinedAlbums = albumsByLabelId.join(countsByLabelId).cacheval albumArtistCountObjects = joinedAlbums.map(x => (new AlbumArtistCount(x._2._1.artist, x
OTHER THINGS TOCHECK OUT
Spark StreamingSpark SQL
QUESTIONS?
THE ENDReferences
Resilient Distributed Datasets: A Fault-Tolerant Abstractionfor In-Memory Cluster ComputingSpark Programming GuideApache Spark WebsiteDatastax Spark Cassandra Connector Documentation
Recommended