Upload
carl-yeksigian
View
161
Download
4
Embed Size (px)
Citation preview
Spark + Cassandra
Carl Yeksigian
DataStax
Spark
-Fast large-scale data processing framework
-Focused on in-memory workloads
-Supports Java, Scala, and Python
-Integrated machine learning support (MLlib)
-Streaming support
-Simple developer API
Resilient Distributed Dataset (RDD)
-Presents a simple Collection API to the
developer
-Breaks full collection into partitions, which can
be operated on independently
-Knows how to recalculate itself if data is lost
-Abstracts how to complete a job from the tasks
RDD
RDD API
Partitions
-Partitions can be created so they are on the
same machine as the data
Uses for Spark with Cassandra
-Ad-hoc queries
-Joins, Unions across tables
-Rewriting tables
-Machine Learning
spark-cassandra-connector
DataStax OSS Projecthttps://github.com/datastax/spark-cassandra-connector
Spark Cassandra Connector
-Exposes Cassandra tables as RDDs
-Read from and write to Cassandra
-Data type mapping
-Scala and Java support
Spark + Bioinformatics
-ADAM is a bioinformatics project out of UC
Berkeley AMPLab
-Combines Spark + Parquet + Avrohttps://github.com/bigdatagenomics/adam
http://bdgenomics.org/
Simple Variant
case class Variant (
sampleid: String,
referencename: String,
location: Long,
allele: String)
create table adam.variants (
sampleid ascii,
referencename ascii,
location bigint,
allele ascii)
Connecting to Cassandra
import com.datastax.spark.connector._
// Spark connection options
val conf = new SparkConf(true)
.setMaster("spark://192.168.345.10:7077")
.setAppName("cassandra-demo")
.set("cassandra.connection.host", "192.168.345.10")
val sc = new SparkContext(conf)
Saving To Cassandra
val variants: RDD[VariantContext] = sc.adamVCFLoad(args(0))
variants.flatMap(getVariant)
.saveToCassandra("adam", "variants", AllColumns)
Querying Cassandra
val rdd = sc.cassandraTable("adam", "variants")
.map(r => (r.get[String]("allele"), 1L))
.reduceByKey(_ + _)
.map(r => (r._2, r._1))
.sortByKey(ascending = false)
rdd.collect()
.foreach(bc => println("%40s\t%d".format(bc._2, bc._1)))
Thanks
Acknowledgements:
Timothy Danford (AMPLab)
Matt Massie (AMPLab)
Frank Nothaft (AMPLab)
Jeff Hammerbacher (Cloudera/Mt Sinai)