Upload
spinningmatt
View
109
Download
2
Embed Size (px)
DESCRIPTION
An introduction to Apache Spark presented at the Boston Apache Spark User Group in July 2014
Citation preview
Boston Apache Spark User Group
(the Spahk group)Microsoft NERD Center - Horace Mann
Tuesday, 15 July 2014
Intro to Apache SparkMatthew Farrellee, @spinningmatt
Updated: July 2014
Background - MapReduce / Hadoop
● Map & reduce around for 5+ decades (McCarthy, 1960)
● Dean and Ghemawat demonstrate map and reduce for distributed data processing (Google, 2004)
● MapReduce paper timed well with commodity hardware capabilities of the early 2000s
● Open source implementation in 2006● Years of innovation improving, simplifying,
expanding
MapReduce / Hadoop difficulties
● Hardware evolved○ Networks became fast○ Memory became cheap
● Programming model proved non-trivial○ Gave birth to multiple attempts to simplify, e.g. Pig,
Hive, ...
● Primarily batch execution mode○ Begat specialized (non-batch) modes, e.g. Storm,
Drill, Giraph, ...
Some history - Spark
● Started in UC Berkeley AMPLab by Matei Zaharia, 2009○ AMP = Algorithms Machines People○ AMPLab is integrating Algorithms, Machines, and
People to make sense of Big Data● Open sourced, 2010● Donated to Apache Software Foundation,
2013● Graduated to top level project, 2014● 1.0 release, May 2014
What is Apache Spark?
An open source, efficient and productive cluster computing system that is interoperable with Hadoop
Open source
● Top level Apache project● http://www.ohloh.net/p/apache-spark
○ In a Nutshell, Apache Spark…○ has had 7,366 commits made by 299 contributors
representing 117,823 lines of code○ is mostly written in Scala with a well-commented
source code○ has a codebase with a long source history
maintained by a very large development team with increasing Y-O-Y commits
○ took an estimated 30 years of effort (COCOMO model) starting with its first commit in March, 2010 ending with its most recent commit 5 days ago
Efficient
● In-memory primitives○ Use cluster memory and spill to disk only when
necessary● High performance
○ https://amplab.cs.berkeley.edu/benchmark/● General compute graphs, DAGs
○ Not just: Load -> Map -> Reduce -> Store -> Load -> Map -> Reduce -> Store
○ Rich and pipelined: Load -> Map -> Union -> Reduce -> Filter -> Group -> Sample -> Store
Interoperable
● Read and write data from HDFS (or any storage system with an HDFS-like API)
● Read and write Hadoop file formats
● Run on YARN
● Interact with Hive, HBase, etc
Productive
● Unified data model, the RDD● Multiple execution modes
○ Batch, interactive, streaming● Multiple languages
○ Scala, Java, Python, R, SQL● Rich standard library
○ Machine learning, streaming, graph processing, ETL● Consistent API across languages● Significant code reduction compared to
MapReduce
Consistent API, less code...
public static class WordCountMapClass extends MapReduceBase
implements Mapper<LongWritable, Text, Text, IntWritable> {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(LongWritable key, Text value,
OutputCollector<Text, IntWritable> output,
Reporter reporter) throws IOException {
String line = value.toString();
StringTokenizer itr = new StringTokenizer(line);
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
output.collect(word, one);
}
}
}
public static class WorkdCountReduce extends MapReduceBase
implements Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterator<IntWritable> values,
OutputCollector<Text, IntWritable> output,
Reporter reporter) throws IOException {
int sum = 0;
while (values.hasNext()) {
sum += values.next().get();
}
output.collect(key, new IntWritable(sum));
}
}
import org.apache.spark._
val sc = new SparkContext(new SparkConf().setAppName(“word count”))
val file = sc.textFile("hdfs://...")
val counts = file.flatMap(line => line.split(" "))
.map(word => (word, 1))
.reduceByKey(_ + _)
counts.saveAsTextFile("hdfs://...")
Spark (Scala) -MapReduce -
from operator import add
from pyspark import SparkContext
sc = SparkContext(conf=SparkConf().setAppName(“word count”))
file = sc.textFile("hdfs://...")
counts = file.flatMap(lambda line: line.split(" "))
.map(lambda word: (word, 1))
.reduceByKey(add)
counts.saveAsTextFile("hdfs://...")
Spark (Python) -
Spark workflow
Valuemattf
mattfRDD
Transform
Action
Load Save
The RDD
The resilient distributed dataset
A lazily evaluated, fault-tolerant collection of elements that can be operated on in parallel
Valuemattf
mattfRDD
Transform
Action
Load Save
RDDs technically
1. a set of partitions ("splits" in hadoop terms)2. list of dependencies on parent RDDs3. function to compute a partition given its
parents4. optional partitioner (hash, range)5. optional preferred location(s) for each
partition
Valuemattf
mattfRDD
Transform
Action
Load Save
Load Valuemattf
mattfRDD
Transform
Action
Load Save
Create an RDD.
● parallelize - convert a collection● textFile - load a text file● wholeTextFiles - load a dir of text files● sequenceFile / hadoopFile - load using
Hadoop file formats● More, http://spark.apache.
org/docs/latest/programming-guide.html#external-datasets
Lazy operations.Build compute DAG.Don’t trigger computation.
● map(func) - elements passed through func● flatMap(func) - func can return >=0 elements● filter(func) - subset of elements● sample(..., fraction, …) - select fraction● union(other) - union of two RDDs● distinct - new RDD w/ distinct elements
Transform Valuemattf
mattfRDD
Transform
Action
Load Save
● groupByKey - (K, V) -> (K, Seq[V])● reduceByKey(func) - (K, V) -> (K, func(V...))● sortByKey - order (K, V) by K● join(other) - (K, V) + (K, W) -> (K, (V, W))● cogroup/groupWith(other) - (K, V) + (K, W) -
> (K, (Seq[V], Seq[W]))● cartesian(other) - cartesian product (all
pairs)
Transform (cont) Valuemattf
mattfRDD
Transform
Action
Load Save
More available in documentation...
http://spark.apache.org/docs/latest/programming-guide.html#transformations
Transform (cont) Valuemattf
mattfRDD
Transform
Action
Load Save
Action Valuemattf
mattfRDD
Transform
Action
Load Save
Active operations.Trigger execution of DAG.Result in a value.
● reduce(func) - reduce elements w/ func● collect - convert to native collection● count - count elements● foreach(func) - apply func to elements● take(n) - return some elements
Action (cont) Valuemattf
mattfRDD
Transform
Action
Load Save
More available in documentation...
http://spark.apache.org/docs/latest/programming-guide.html#actions
Save Valuemattf
mattfRDD
Transform
Action
Load Save
Action that results indata stored to file system.
● saveAsTextFile● saveAsSequenceFile● saveAsObjectFile/saveAsPickleFile● More, http://spark.apache.
org/docs/latest/programming-guide.html#actions
Spark workflow
Valuemattf
mattfRDD
Transform
Action
Load Save
file = sc.textFile("hdfs://...")counts = file.flatMap(lambda line: line.split(" ")) .map(lambda word: (word, 1)) .reduceByKey(add)counts.saveAsTextFile("hdfs://...")
Multiple modes, rich standard library
Spark Core
SQL Streaming MLLib GraphX ...
Apache Spark
Spark SQL● Components
○ Catalyst - generic optimization for relational algebra○ Core - RDD execution; formats: Parquet, JSON○ Hive support - run HiveQL and use Hive warehouse
● SchemaRDDs and SQL
Spark
SQL Stream MLlib GraphX ...
User
User
User Name Age Height
Name Age Height
Name Age Height
sqlCtx.sql(“SELECT name FROM people WHERE age >= 13 AND age <= 19”)
RDD SchemaRDD
Spark StreamingSpark
SQL Stream MLlib GraphX ...
● Run a streaming computation as a series of time bound, deterministic batch jobs
● Time bound used to break stream into RDDs
Spark Streaming Spark
Stream RDDs Results
X seconds wide
MLlibSpark
SQL Stream MLlib GraphX ...
● Machine learning algorithms over RDDs
● Classification - logistic regression, linear support vector machines, naive Bayes, decision trees
● Regression - linear regression, regression trees● Collaborative filtering - alternating least squares● Clustering - K-Means● Optimization - stochastic gradient descent, limited-
memory BFGS● Dimensionality reduction - singular value
decomposition, principal component analysis
Deploying Spark
Source: http://spark.apache.org/docs/latest/cluster-overview.html
Deploying Spark
● Driver program - shell or standalone program that creates a SparkContext and works with RDDs
● Cluster Manager - standalone, Mesos or YARN○ Standalone - the default, simple setup,
master + worker processes on nodes○ Mesos - a general purpose manager that
runs Hadoop and other services. Two modes of operation, fine & coarse.
○ YARN - Hadoop 2’s resource manager
Highlights from Spark Summit 2014
http://spark-summit.org/east/2015
New York in early 2015
Thanks to...
@pacoid, @rxin and @michaelarmbrust for letting me crib slides for this introduction