Upload
dean-chen
View
6.893
Download
5
Embed Size (px)
DESCRIPTION
Video to talk: https://www.youtube.com/watch?v=gd4Jqtyo7mM Apache Spark is a next generation engine for large scale data processing built with Scala. This talk will first show how Spark takes advantage of Scala's function idioms to produce an expressive and intuitive API. You will learn about the design of Spark RDDs and the abstraction enables the Spark execution engine to be extended to support a wide variety of use cases(Spark SQL, Spark Streaming, MLib and GraphX). The Spark source will be be referenced to illustrate how these concepts are implemented with Scala. http://www.meetup.com/Scala-Bay/events/209740892/
Citation preview
Apache Spark RDDsDean Chen eBay Inc.
http://spark-summit.org/wp-content/uploads/2014/07/Sparks-Role-in-the-Big-Data-Ecosystem-Matei-Zaharia1.pdf
Spark• 2010 paper Berkley's AMPLab
• resilient distributed datasets (RDDs)
• Generalized distributed computation engine/platform
• Fault tolerant in memory caching
• Extensible interface for various work loads
http://spark-summit.org/wp-content/uploads/2014/07/Sparks-Role-in-the-Big-Data-Ecosystem-Matei-Zaharia1.pdf
https://amplab.cs.berkeley.edu/software/
RDDs• Resilient distributed datasets
• "read-only collection of objects partitioned across a set of machines that can be rebuilt if a partition is lost"
• Familiar Scala collections API for distributed data and computation
• Monadic expression of lazy transformations, but not monads
Spark Shell
• Interactive queries and prototyping
• Local, YARN, Mesos
• Static type checking and auto complete
• Lambdas
val titles = sc.textFile("titles.txt")
val countsRdd = titles .flatMap(tokenize) .map(word => (cleanse(word), 1)) .reduceByKey(_ + _)
val counts = countsRdd .filter{case(_, total) => total > 10000} .sortBy{case(_, total) => total} .filter{case(word, _) => word.length >= 5} .collect
Transformations
map filter flatMap sample union intersection
distinct groupByKey reduceByKey sortByKey join cogroup cartesian
Actions
reduce collect count first
take takeSample saveAsTextFile foreach
val sqlContext = new org.apache.spark.sql.SQLContext(sc) import sqlContext._
case class Count(word: String, total: Int)
val schemaRdd = countsRdd.map(c => Count(c._1, c._2))
val count = schemaRdd .where('word === "scala") .select('total) .collect
schemaRdd.registerTempTable("counts")
sql(" SELECT total FROM counts WHERE word = 'scala' ").collect
schemaRdd .filter(_.word == "scala") .map(_.total) .collect
registerFunction("LEN", (_: String).length) val queryRdd = sql(" SELECT * FROM counts WHERE LEN(word) = 10 ORDER BY total DESC LIMIT 10 ") queryRdd .map(c => s"word: ${c(0)} \t| total: ${c(1)}") .collect() .foreach(println)
Spark Streaming
• Realtime computation similar to Storm
• Input distributed to memory for fault tolerance
• Streaming input in to sliding windows of RDDs
• Kafka, Flume, Kinesis, HDFS
TwitterUtils.createStream() .filter(_.getText.contains("Spark")) .countByWindow(Seconds(5))
GraphX
• Optimally partitions and indexes vertices and edges represented as RDDs
• APIs to join and traverse graphs
• PageRank, connected components, triangle counting
val graph = Graph(userIdRDD, assocRDD)
val ranks = graph.pageRank(0.0001).vertices
val userRDD = sc.textFile("graphx/data/users.txt")val users = userRdd.map { line => val fields = line.split(",") (fields(0).toLong, fields(1))}val ranksByUsername = users.join(ranks).map { case (id, (username, rank)) => (username, rank)}
MLib
• Machine learning library similar to Mahout
• Statistics, regression, decision trees, clustering, PCA, gradient descent
• Iterative algorithms much faster due to in memory caching
val data = sc.textFile("data.txt")val parsedData = data.map { line => val parts = line.split(',') LabeledPoint( parts(0).toDouble, Vectors.dense(parts(1).split(' ').map(_.toDouble)) )}
val model = LinearRegressionWithSGD.train( parsedData, 100)
val valuesAndPreds = parsedData.map { point => val prediction = model.predict(point.features) (point.label, prediction)}val MSE = valuesAndPreds .map{case(v, p) => math.pow((v - p), 2)}.mean()
RDDs• Resilient distributed datasets
• Familiar Scala collections API
• Distributed data and computation
• Monadic expression of transformations
• But not monads
Pseudo Monad
• Wraps iterator + partitions distribution
• Keeps track of history for fault tolerance
• Lazily evaluated, chaining of expressions
https://databricks-training.s3.amazonaws.com/slides/advanced-spark-training.pdf
RDD Interface
• compute: transformation applied to iterable(s)
• getPartitions: partition data for parallel computation
• getDependencies: lineage of parent RDDs and if shuffle is required
HadoopRDD
• compute: read HDFS block or file split
• getPartitions: HDFS block or file split
• getDependencies: None
MappedRDD
• compute: compute parent and map result
• getPartitions: parent partition
• getDependencies: single dependency on parent
CoGroupedRDD
• compute: compute, shuffle then group parent RDDs
• getPartitions: one per reduce task
• getDependencies: shuffle each parent RDD
Summary
• Simple Unified API through RDDs
• Interactive Analysis
• Hadoop Integration
• Performance
References• http://www.cs.berkeley.edu/~matei/papers/2010/
hotcloud_spark.pdf
• https://www.youtube.com/watch?v=HG2Yd-3r4-M
• https://www.youtube.com/watch?v=e-Ys-2uVxM0
• RDD, MappedRDD, SchemaRDD, RDDFunctions, GraphOps, DStream