35
Apache Spark RDDs Dean Chen eBay Inc.

Apache Spark RDDs

Embed Size (px)

DESCRIPTION

Video to talk: https://www.youtube.com/watch?v=gd4Jqtyo7mM Apache Spark is a next generation engine for large scale data processing built with Scala. This talk will first show how Spark takes advantage of Scala's function idioms to produce an expressive and intuitive API. You will learn about the design of Spark RDDs and the abstraction enables the Spark execution engine to be extended to support a wide variety of use cases(Spark SQL, Spark Streaming, MLib and GraphX). The Spark source will be be referenced to illustrate how these concepts are implemented with Scala. http://www.meetup.com/Scala-Bay/events/209740892/

Citation preview

Page 1: Apache Spark RDDs

Apache Spark RDDsDean Chen eBay Inc.

Page 2: Apache Spark RDDs

http://spark-summit.org/wp-content/uploads/2014/07/Sparks-Role-in-the-Big-Data-Ecosystem-Matei-Zaharia1.pdf

Page 3: Apache Spark RDDs

Spark• 2010 paper Berkley's AMPLab

• resilient distributed datasets (RDDs)

• Generalized distributed computation engine/platform

• Fault tolerant in memory caching

• Extensible interface for various work loads

Page 4: Apache Spark RDDs

http://spark-summit.org/wp-content/uploads/2014/07/Sparks-Role-in-the-Big-Data-Ecosystem-Matei-Zaharia1.pdf

Page 5: Apache Spark RDDs

https://amplab.cs.berkeley.edu/software/

Page 6: Apache Spark RDDs

RDDs• Resilient distributed datasets

• "read-only collection of objects partitioned across a set of machines that can be rebuilt if a partition is lost"

• Familiar Scala collections API for distributed data and computation

• Monadic expression of lazy transformations, but not monads

Page 7: Apache Spark RDDs

Spark Shell

• Interactive queries and prototyping

• Local, YARN, Mesos

• Static type checking and auto complete

• Lambdas

Page 8: Apache Spark RDDs
Page 9: Apache Spark RDDs

val titles = sc.textFile("titles.txt")

val countsRdd = titles .flatMap(tokenize) .map(word => (cleanse(word), 1)) .reduceByKey(_ + _)

val counts = countsRdd .filter{case(_, total) => total > 10000} .sortBy{case(_, total) => total} .filter{case(word, _) => word.length >= 5} .collect

Page 10: Apache Spark RDDs

Transformations

map filter flatMap sample union intersection

distinct groupByKey reduceByKey sortByKey join cogroup cartesian

Page 11: Apache Spark RDDs

Actions

reduce collect count first

take takeSample saveAsTextFile foreach

Page 12: Apache Spark RDDs
Page 13: Apache Spark RDDs

val sqlContext = new org.apache.spark.sql.SQLContext(sc) import sqlContext._

case class Count(word: String, total: Int)

val schemaRdd = countsRdd.map(c => Count(c._1, c._2))

val count = schemaRdd .where('word === "scala") .select('total) .collect

Page 14: Apache Spark RDDs

schemaRdd.registerTempTable("counts")

sql(" SELECT total FROM counts WHERE word = 'scala' ").collect

schemaRdd .filter(_.word == "scala") .map(_.total) .collect

Page 15: Apache Spark RDDs

registerFunction("LEN", (_: String).length) val queryRdd = sql(" SELECT * FROM counts WHERE LEN(word) = 10 ORDER BY total DESC LIMIT 10 ") queryRdd .map(c => s"word: ${c(0)} \t| total: ${c(1)}") .collect() .foreach(println)

Page 16: Apache Spark RDDs
Page 17: Apache Spark RDDs

Spark Streaming

• Realtime computation similar to Storm

• Input distributed to memory for fault tolerance

• Streaming input in to sliding windows of RDDs

• Kafka, Flume, Kinesis, HDFS

Page 18: Apache Spark RDDs
Page 19: Apache Spark RDDs

TwitterUtils.createStream() .filter(_.getText.contains("Spark")) .countByWindow(Seconds(5))

Page 20: Apache Spark RDDs
Page 21: Apache Spark RDDs

GraphX

• Optimally partitions and indexes vertices and edges represented as RDDs

• APIs to join and traverse graphs

• PageRank, connected components, triangle counting

Page 22: Apache Spark RDDs

val graph = Graph(userIdRDD, assocRDD)

val ranks = graph.pageRank(0.0001).vertices

val userRDD = sc.textFile("graphx/data/users.txt")val users = userRdd.map { line => val fields = line.split(",") (fields(0).toLong, fields(1))}val ranksByUsername = users.join(ranks).map { case (id, (username, rank)) => (username, rank)}

Page 23: Apache Spark RDDs
Page 24: Apache Spark RDDs

MLib

• Machine learning library similar to Mahout

• Statistics, regression, decision trees, clustering, PCA, gradient descent

• Iterative algorithms much faster due to in memory caching

Page 25: Apache Spark RDDs

val data = sc.textFile("data.txt")val parsedData = data.map { line => val parts = line.split(',') LabeledPoint( parts(0).toDouble, Vectors.dense(parts(1).split(' ').map(_.toDouble)) )}

val model = LinearRegressionWithSGD.train( parsedData, 100)

val valuesAndPreds = parsedData.map { point => val prediction = model.predict(point.features) (point.label, prediction)}val MSE = valuesAndPreds .map{case(v, p) => math.pow((v - p), 2)}.mean()

Page 26: Apache Spark RDDs

RDDs• Resilient distributed datasets

• Familiar Scala collections API

• Distributed data and computation

• Monadic expression of transformations

• But not monads

Page 27: Apache Spark RDDs

Pseudo Monad

• Wraps iterator + partitions distribution

• Keeps track of history for fault tolerance

• Lazily evaluated, chaining of expressions

Page 28: Apache Spark RDDs

https://databricks-training.s3.amazonaws.com/slides/advanced-spark-training.pdf

Page 29: Apache Spark RDDs

RDD Interface

• compute: transformation applied to iterable(s)

• getPartitions: partition data for parallel computation

• getDependencies: lineage of parent RDDs and if shuffle is required

Page 30: Apache Spark RDDs

HadoopRDD

• compute: read HDFS block or file split

• getPartitions: HDFS block or file split

• getDependencies: None

Page 31: Apache Spark RDDs

MappedRDD

• compute: compute parent and map result

• getPartitions: parent partition

• getDependencies: single dependency on parent

Page 32: Apache Spark RDDs

CoGroupedRDD

• compute: compute, shuffle then group parent RDDs

• getPartitions: one per reduce task

• getDependencies: shuffle each parent RDD

Page 33: Apache Spark RDDs

Summary

• Simple Unified API through RDDs

• Interactive Analysis

• Hadoop Integration

• Performance

Page 34: Apache Spark RDDs

References• http://www.cs.berkeley.edu/~matei/papers/2010/

hotcloud_spark.pdf

• https://www.youtube.com/watch?v=HG2Yd-3r4-M

• https://www.youtube.com/watch?v=e-Ys-2uVxM0

• RDD, MappedRDD, SchemaRDD, RDDFunctions, GraphOps, DStream