73
© 2014 MapR Technologies 1 ® © 2014 MapR Technologies Apache Spark Keys Botzum Senior Principal Technologist, MapR Technologies June 2014

Apache Spark & Hadoop

Embed Size (px)

DESCRIPTION

Hadoop has been a huge success in the data world. It’s disrupted decades of data management practices and technologies by introducing a massively parallel processing framework. The community and the development of all the Open Source components pushed Hadoop to where it is now. That's why the Hadoop community is excited about Apache Spark. The Spark software stack includes a core data-processing engine, an interface for interactive querying, Sparkstreaming for streaming data analysis, and growing libraries for machine-learning and graph analysis. Spark is quickly establishing itself as a leading environment for doing fast, iterative in-memory and streaming analysis. This talk will give an introduction the Spark stack, explain how Spark has lighting fast results, and how it complements Apache Hadoop. Keys Botzum - Senior Principal Technologist with MapR Technologies Keys is Senior Principal Technologist with MapR Technologies, where he wears many hats. His primary responsibility is interacting with customers in the field, but he also teaches classes, contributes to documentation, and works with engineering teams. He has over 15 years of experience in large scale distributed system design. Previously, he was a Senior Technical Staff Member with IBM, and a respected author of many articles on the WebSphere Application Server as well as a book.

Citation preview

Page 1: Apache Spark & Hadoop

®© 2014 MapR Technologies 1

®

© 2014 MapR Technologies

Apache Spark

Keys Botzum Senior Principal Technologist, MapR Technologies

June 2014

Page 2: Apache Spark & Hadoop

®© 2014 MapR Technologies 2

Agenda •  MapReduce •  Apache Spark •  How Spark Works •  Fault Tolerance and Performance •  Examples •  Spark and More

Page 3: Apache Spark & Hadoop

®© 2014 MapR Technologies 3

MapR: Best Product, Best Business & Best Customers

Top Ranked Exponential Growth

500+ Customers Cloud Leaders

3X bookings Q1 ‘13 – Q1 ‘14

80% of accounts expand 3X

90% software licenses

< 1% lifetime churn

> $1B in incremental revenue generated by 1 customer

Page 4: Apache Spark & Hadoop

®© 2014 MapR Technologies 4 © 2014 MapR Technologies ®

Review: MapReduce

Page 5: Apache Spark & Hadoop

®© 2014 MapR Technologies 5

MapReduce: A Programming Model

•  MapReduce: Simplified Data Processing on Large Clusters (published 2004)

•  Parallel and Distributed Algorithm:

•  Data Locality •  Fault Tolerance •  Linear Scalability

Page 6: Apache Spark & Hadoop

®© 2014 MapR Technologies 6

MapReduce Basics •  Assumes scalable distributed file system that

shards data •  Map

–  Loading of the data and defining a set of keys

•  Reduce –  Collects the organized key-based data to process

and output •  Performance can be tweaked based on known

details of your source files and cluster shape (size, total number)

Page 7: Apache Spark & Hadoop

®© 2014 MapR Technologies 7

MapReduce Processing Model •  Define mappers •  Shuffling is automatic •  Define reducers •  For complex work, chain jobs together

Page 8: Apache Spark & Hadoop

®© 2014 MapR Technologies 8

MapReduce: The Good

•  Built in fault tolerance •  Optimized IO path •  Scalable •  Developer focuses on Map/Reduce, not

infrastructure •  simple? API

Page 9: Apache Spark & Hadoop

®© 2014 MapR Technologies 9

MapReduce: The Bad

•  Optimized for disk IO –  Doesn’t leverage memory well –  Iterative algorithms go through disk IO path again

and again •  Primitive API

–  Developer’s have to build on very simple abstraction –  Key/Value in/out –  Even basic things like join require extensive code

•  Result often many files that need to be combined appropriately

Page 10: Apache Spark & Hadoop

®© 2014 MapR Technologies 10 © 2014 MapR Technologies ®

Apache Spark

Page 11: Apache Spark & Hadoop

®© 2014 MapR Technologies 11

Apache Spark

•  spark.apache.org •  github.com/apache/spark •  [email protected]

•  Originally developed in 2009 in UC Berkeley’s AMP Lab

•  Fully open sourced in 2010 – now at Apache Software Foundation

- Commercial Vendor Developing/Supporting

Page 12: Apache Spark & Hadoop

®© 2014 MapR Technologies 12

Spark: Easy and Fast Big Data

•  Easy to Develop –  Rich APIs in

Java, Scala, Python

–  Interactive shell

•  Fast to Run –  General execution

graphs –  In-memory storage

2-5× less code

Page 13: Apache Spark & Hadoop

®© 2014 MapR Technologies 13

Resilient Distributed Datasets (RDD) •  Spark revolves around RDDs •  Fault-tolerant read only collection of elements

that can be operated on in parallel •  Cached in memory or on disk http://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf

Page 14: Apache Spark & Hadoop

®© 2014 MapR Technologies 14

RDD Operations - Expressive •  Transformations

–  Creation of a new RDD dataset from an existing •  map, filter, distinct, union, sample, groupByKey, join,

reduce, etc…

•  Actions –  Return a value after running a computation

•  collect, count, first, takeSample, foreach, etc…

Check the documentation for a complete list

http://spark.apache.org/docs/latest/scala-programming-guide.html#rdd-operations

Page 15: Apache Spark & Hadoop

®© 2014 MapR Technologies 15

Easy: Clean API

•  Resilient Distributed Datasets

•  Collections of objects spread across a cluster, stored in RAM or on Disk

•  Built through parallel transformations

•  Automatically rebuilt on failure

•  Operations •  Transformations

(e.g. map, filter, groupBy)

•  Actions (e.g. count, collect, save)

Write programs in terms of transformations on distributed datasets

Page 16: Apache Spark & Hadoop

®© 2014 MapR Technologies 16

Easy: Expressive API

•  map •  reduce

Page 17: Apache Spark & Hadoop

®© 2014 MapR Technologies 17

Easy: Expressive API

•  map

•  filter

•  groupBy

•  sort

•  union

•  join

•  leftOuterJoin

•  rightOuterJoin

•  reduce

•  count

•  fold

•  reduceByKey

•  groupByKey

•  cogroup

•  cross

•  zip

sample

take

first

partitionBy

mapWith

pipe

save ...

Page 18: Apache Spark & Hadoop

®© 2014 MapR Technologies 18

Easy: Example – Word Count

•  Spark •  Hadoop MapReduce public static class WordCountMapClass extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { String line = value.toString(); StringTokenizer itr = new StringTokenizer(line); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); output.collect(word, one); } } } public static class WorkdCountReduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { int sum = 0; while (values.hasNext()) { sum += values.next().get(); } output.collect(key, new IntWritable(sum)); } }

val spark = new SparkContext(master, appName, [sparkHome], [jars]) val file = spark.textFile("hdfs://...") val counts = file.flatMap(line => line.split(" ")) .map(word => (word, 1)) .reduceByKey(_ + _) counts.saveAsTextFile("hdfs://...")

Page 19: Apache Spark & Hadoop

®© 2014 MapR Technologies 19

Easy: Example – Word Count

•  Spark •  Hadoop MapReduce public static class WordCountMapClass extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { String line = value.toString(); StringTokenizer itr = new StringTokenizer(line); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); output.collect(word, one); } } } public static class WorkdCountReduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { int sum = 0; while (values.hasNext()) { sum += values.next().get(); } output.collect(key, new IntWritable(sum)); } }

val spark = new SparkContext(master, appName, [sparkHome], [jars]) val file = spark.textFile("hdfs://...") val counts = file.flatMap(line => line.split(" ")) .map(word => (word, 1)) .reduceByKey(_ + _) counts.saveAsTextFile("hdfs://...")

Page 20: Apache Spark & Hadoop

®© 2014 MapR Technologies 20

Easy: Works Well With Hadoop •  Data Compatibility •  Access your existing

Hadoop Data •  Use the same data

formats •  Adheres to data

locality for efficient processing

•  Deployment Models

•  “Standalone” deployment

•  YARN-based deployment

•  Mesos-based deployment

•  Deploy on existing Hadoop cluster or side-by-side

Page 21: Apache Spark & Hadoop

®© 2014 MapR Technologies 21

Easy: User-Driven Roadmap •  Language support

–  Improved Python support

–  SparkR –  Java 8 –  Integrated Schema

and SQL support in Spark’s APIs

•  Better ML –  Sparse Data

Support –  Model Evaluation

Framework –  Performance Testing

Page 22: Apache Spark & Hadoop

®© 2014 MapR Technologies 22

Example: Logistic Regression

data = spark.textFile(...).map(readPoint).cache() w = numpy.random.rand(D) for i in range(iterations): gradient = data .map(lambda p: (1 / (1 + exp(-p.y * w.dot(p.x))))

* p.y * p.x) .reduce(lambda x, y: x + y)

w -= gradient print “Final w: %s” % w

Page 23: Apache Spark & Hadoop

®© 2014 MapR Technologies 23

Fast: Logistic Regression Performance

0 500

1000 1500 2000 2500 3000 3500 4000

1 5 10 20 30

Run

ning

Tim

e (s

)

Number of Iterations

Hadoop

Spark

110  s  /  iteration  

first  iteration  80  s  further  iterations  1  s  

Page 24: Apache Spark & Hadoop

®© 2014 MapR Technologies 24

Easy: Multi-language Support Python lines = sc.textFile(...) lines.filter(lambda s: “ERROR” in s).count()

Scala val lines = sc.textFile(...) lines.filter(x => x.contains(“ERROR”)).count()

Java JavaRDD<String> lines = sc.textFile(...); lines.filter(new Function<String, Boolean>() { Boolean call(String s) { return s.contains(“error”); } }).count();

Page 25: Apache Spark & Hadoop

®© 2014 MapR Technologies 25

Easy: Interactive Shell Scala based shell % /opt/mapr/spark/spark-0.9.1/bin/spark-shell scala> val logs = sc.textFile("hdfs:///user/keys/logdata”)"scala> logs.count()"…"res0: Long = 232681

scala> logs.filter(l => l.contains("ERROR")).count()"…."res1: Long = 205

Python based shell as well - pyspark

Page 26: Apache Spark & Hadoop

®© 2014 MapR Technologies 26 © 2014 MapR Technologies ®

Fault Tolerance and Performance

Page 27: Apache Spark & Hadoop

®© 2014 MapR Technologies 27

Fast: Using RAM, Operator Graphs

•  In-memory Caching •  Data Partitions read

from RAM instead of disk

•  Operator Graphs •  Scheduling

Optimizations •  Fault Tolerance

=  cached  partition  

=  RDD  

join  

filter  

groupBy  

Stage  3  

Stage  1  

Stage  2  

A:   B:  

C:   D:   E:  

F:  

map  

Page 28: Apache Spark & Hadoop

®© 2014 MapR Technologies 28

Directed Acylic Graph (DAG) •  Directed

–  Only in a single direction

•  Acyclic –  No looping

•  This supports fault-tolerance

Page 29: Apache Spark & Hadoop

®© 2014 MapR Technologies 29

Easy: Fault Recovery

RDDs track lineage information that can be used to efficiently recompute lost data

msgs = textFile.filter(lambda s: s.startsWith(“ERROR”)) .map(lambda s: s.split(“\t”)[2])

HDFS File Filtered RDD Mapped RDD filter  

(func  =  startsWith(…))  map  

(func  =  split(...))  

Page 30: Apache Spark & Hadoop

®© 2014 MapR Technologies 30

RDD Persistence / Caching •  Variety of storage levels

–  memory_only (default), memory_and_disk, etc… •  API Calls

–  persist(StorageLevel) –  cache() – shorthand for

persist(StorageLevel.MEMORY_ONLY) •  Considerations

–  Read from disk vs. recompute (memory_and_disk) –  Total memory storage size (memory_only_ser) –  Replicate to second node for faster fault recovery

(memory_only_2) •  Think about this option if supporting a time sensitive client

http://spark.apache.org/docs/latest/scala-programming-guide.html#rdd-persistence

Page 31: Apache Spark & Hadoop

®© 2014 MapR Technologies 31

PageRank Performance

171

80

23

14 0

50

100

150

200

30 60

Itera

tion

time

(s)

Number of machines

Hadoop

Spark

Page 32: Apache Spark & Hadoop

®© 2014 MapR Technologies 32

Other Iterative Algorithms

0.96 110

0 25 50 75 100 125

Logistic Regression

4.1 155

0 30 60 90 120 150 180

K-Means Clustering

Hadoop

Spark

Time per Iteration (s)

Page 33: Apache Spark & Hadoop

®© 2014 MapR Technologies 33

Fast: Scaling Down

69  

58  

41  

30  

12  

0  

20  

40  

60  

80  

100  

Cache  disabled  

25%   50%   75%   Fully  cached  

Exec

ution  time  (s)  

%  of  working  set  in  cache  

Page 34: Apache Spark & Hadoop

®© 2014 MapR Technologies 34

Comparison to Storm •  Higher throughput than Storm

–  Spark Streaming: 670k records/sec/node –  Storm: 115k records/sec/node –  Commercial systems: 100-500k records/sec/node

0  

10  

20  

30  

100   1000  

Throughp

ut  per  nod

e  (M

B/s)  

Record  Size  (bytes)  

WordCount  

Spark  

Storm  

0  

20  

40  

60  

100   1000  

Throughp

ut  per  nod

e  (M

B/s)  

Record  Size  (bytes)  

Grep  

Spark  

Storm  

Page 35: Apache Spark & Hadoop

®© 2014 MapR Technologies 35 © 2014 MapR Technologies ®

How Spark Works

Page 36: Apache Spark & Hadoop

®© 2014 MapR Technologies 36

Working With RDDs

Page 37: Apache Spark & Hadoop

®© 2014 MapR Technologies 37

Working With RDDs

RDD

textFile = sc.textFile(”SomeFile.txt”) !

Page 38: Apache Spark & Hadoop

®© 2014 MapR Technologies 38

Working With RDDs

RDD RDD RDD RDD

Transformations

linesWithSpark = textFile.filter(lambda line: "Spark” in line) !

textFile = sc.textFile(”SomeFile.txt”) !

Page 39: Apache Spark & Hadoop

®© 2014 MapR Technologies 39

Working With RDDs

RDD RDD RDD RDD

Transformations

Action Value

linesWithSpark = textFile.filter(lambda line: "Spark” in line) !

linesWithSpark.count()!74!!linesWithSpark.first()!# Apache Spark!

textFile = sc.textFile(”SomeFile.txt”) !

Page 40: Apache Spark & Hadoop

®© 2014 MapR Technologies 40 © 2014 MapR Technologies ®

Example: Log Mining

Page 41: Apache Spark & Hadoop

®© 2014 MapR Technologies 41

Example: Log Mining Load error messages from a log into memory, then

interactively search for various patterns

Page 42: Apache Spark & Hadoop

®© 2014 MapR Technologies 42

Example: Log Mining Load error messages from a log into memory, then

interactively search for various patterns

Worker

Worker

Worker

Driver

Page 43: Apache Spark & Hadoop

®© 2014 MapR Technologies 43

Example: Log Mining Load error messages from a log into memory, then

interactively search for various patterns

Worker

Worker

Worker

Driver

lines = spark.textFile(“hdfs://...”)

Page 44: Apache Spark & Hadoop

®© 2014 MapR Technologies 44

Example: Log Mining Load error messages from a log into memory, then

interactively search for various patterns

Worker

Worker

Worker

Driver

lines = spark.textFile(“hdfs://...”)

Base RDD

Page 45: Apache Spark & Hadoop

®© 2014 MapR Technologies 45

Example: Log Mining Load error messages from a log into memory, then

interactively search for various patterns

lines = spark.textFile(“hdfs://...”)

errors = lines.filter(lambda s: s.startswith(“ERROR”))

Worker

Worker

Worker

Driver

Page 46: Apache Spark & Hadoop

®© 2014 MapR Technologies 46

Example: Log Mining Load error messages from a log into memory, then

interactively search for various patterns

lines = spark.textFile(“hdfs://...”)

errors = lines.filter(lambda s: s.startswith(“ERROR”))

Worker

Worker

Worker

Driver

Transformed RDD

Page 47: Apache Spark & Hadoop

®© 2014 MapR Technologies 47

Example: Log Mining Load error messages from a log into memory, then

interactively search for various patterns

lines = spark.textFile(“hdfs://...”)

errors = lines.filter(lambda s: s.startswith(“ERROR”))

messages = errors.map(lambda s: s.split(“\t”)[2])

messages.cache()

Worker

Worker

Worker

Driver

messages.filter(lambda s: “mysql” in s).count()

Page 48: Apache Spark & Hadoop

®© 2014 MapR Technologies 48

Example: Log Mining Load error messages from a log into memory, then

interactively search for various patterns

lines = spark.textFile(“hdfs://...”)

errors = lines.filter(lambda s: s.startswith(“ERROR”))

messages = errors.map(lambda s: s.split(“\t”)[2])

messages.cache()

Worker

Worker

Worker

Driver

messages.filter(lambda s: “mysql” in s).count() Action

Page 49: Apache Spark & Hadoop

®© 2014 MapR Technologies 49

Example: Log Mining Load error messages from a log into memory, then

interactively search for various patterns

lines = spark.textFile(“hdfs://...”)

errors = lines.filter(lambda s: s.startswith(“ERROR”))

messages = errors.map(lambda s: s.split(“\t”)[2])

messages.cache()

Worker

Worker

Worker

Driver

messages.filter(lambda s: “mysql” in s).count()

Block 1

Block 2

Block 3

Page 50: Apache Spark & Hadoop

®© 2014 MapR Technologies 50

Example: Log Mining Load error messages from a log into memory, then

interactively search for various patterns

lines = spark.textFile(“hdfs://...”)

errors = lines.filter(lambda s: s.startswith(“ERROR”))

messages = errors.map(lambda s: s.split(“\t”)[2])

messages.cache()

Worker

Worker

Worker messages.filter(lambda s: “mysql” in s).count()

Block 1

Block 2

Block 3

Driver tasks

tasks

tasks

Page 51: Apache Spark & Hadoop

®© 2014 MapR Technologies 51

Example: Log Mining Load error messages from a log into memory, then

interactively search for various patterns

lines = spark.textFile(“hdfs://...”)

errors = lines.filter(lambda s: s.startswith(“ERROR”))

messages = errors.map(lambda s: s.split(“\t”)[2])

messages.cache()

Worker

Worker

Worker messages.filter(lambda s: “mysql” in s).count()

Block 1

Block 2

Block 3

Driver

Read HDFS Block

Read HDFS Block

Read HDFS Block

Page 52: Apache Spark & Hadoop

®© 2014 MapR Technologies 52

Example: Log Mining Load error messages from a log into memory, then

interactively search for various patterns

lines = spark.textFile(“hdfs://...”)

errors = lines.filter(lambda s: s.startswith(“ERROR”))

messages = errors.map(lambda s: s.split(“\t”)[2])

messages.cache()

Worker

Worker

Worker messages.filter(lambda s: “mysql” in s).count()

Block 1

Block 2

Block 3

Driver

Cache 1

Cache 2

Cache 3

Process & Cache Data

Process & Cache Data

Process & Cache Data

Page 53: Apache Spark & Hadoop

®© 2014 MapR Technologies 53

Example: Log Mining Load error messages from a log into memory, then

interactively search for various patterns

lines = spark.textFile(“hdfs://...”)

errors = lines.filter(lambda s: s.startswith(“ERROR”))

messages = errors.map(lambda s: s.split(“\t”)[2])

messages.cache()

Worker

Worker

Worker messages.filter(lambda s: “mysql” in s).count()

Block 1

Block 2

Block 3

Driver

Cache 1

Cache 2

Cache 3

results

results

results

Page 54: Apache Spark & Hadoop

®© 2014 MapR Technologies 54

Example: Log Mining Load error messages from a log into memory, then

interactively search for various patterns

lines = spark.textFile(“hdfs://...”)

errors = lines.filter(lambda s: s.startswith(“ERROR”))

messages = errors.map(lambda s: s.split(“\t”)[2])

messages.cache()

Worker

Worker

Worker messages.filter(lambda s: “mysql” in s).count()

Block 1

Block 2

Block 3

Driver

Cache 1

Cache 2

Cache 3

messages.filter(lambda s: “php” in s).count()

Page 55: Apache Spark & Hadoop

®© 2014 MapR Technologies 55

Example: Log Mining Load error messages from a log into memory, then

interactively search for various patterns

lines = spark.textFile(“hdfs://...”)

errors = lines.filter(lambda s: s.startswith(“ERROR”))

messages = errors.map(lambda s: s.split(“\t”)[2])

messages.cache()

Worker

Worker

Worker messages.filter(lambda s: “mysql” in s).count()

Block 1

Block 2

Block 3

Cache 1

Cache 2

Cache 3

messages.filter(lambda s: “php” in s).count()

tasks

tasks

tasks

Driver

Page 56: Apache Spark & Hadoop

®© 2014 MapR Technologies 56

Example: Log Mining Load error messages from a log into memory, then

interactively search for various patterns

lines = spark.textFile(“hdfs://...”)

errors = lines.filter(lambda s: s.startswith(“ERROR”))

messages = errors.map(lambda s: s.split(“\t”)[2])

messages.cache()

Worker

Worker

Worker messages.filter(lambda s: “mysql” in s).count()

Block 1

Block 2

Block 3

Cache 1

Cache 2

Cache 3

messages.filter(lambda s: “php” in s).count()

Driver

Process from Cache

Process from Cache

Process from Cache

Page 57: Apache Spark & Hadoop

®© 2014 MapR Technologies 57

Example: Log Mining Load error messages from a log into memory, then

interactively search for various patterns

lines = spark.textFile(“hdfs://...”)

errors = lines.filter(lambda s: s.startswith(“ERROR”))

messages = errors.map(lambda s: s.split(“\t”)[2])

messages.cache()

Worker

Worker

Worker messages.filter(lambda s: “mysql” in s).count()

Block 1

Block 2

Block 3

Cache 1

Cache 2

Cache 3

messages.filter(lambda s: “php” in s).count()

Driver results

results

results

Page 58: Apache Spark & Hadoop

®© 2014 MapR Technologies 58

Example: Log Mining Load error messages from a log into memory, then

interactively search for various patterns

lines = spark.textFile(“hdfs://...”)

errors = lines.filter(lambda s: s.startswith(“ERROR”))

messages = errors.map(lambda s: s.split(“\t”)[2])

messages.cache()

Worker

Worker

Worker messages.filter(lambda s: “mysql” in s).count()

Block 1

Block 2

Block 3

Cache 1

Cache 2

Cache 3

messages.filter(lambda s: “php” in s).count()

Driver

Cache your data è Faster Results Full-text search of Wikipedia •  60GB on 20 EC2 machines •  0.5 sec from cache vs. 20s for on-disk

Page 59: Apache Spark & Hadoop

®© 2014 MapR Technologies 59 © 2014 MapR Technologies ®

Example: Page Rank

Page 60: Apache Spark & Hadoop

®© 2014 MapR Technologies 60

Example: PageRank •  Good example of a more complex algorithm

–  Multiple stages of map & reduce

•  Benefits from Spark’s in-memory caching –  Multiple iterations over the same data

Page 61: Apache Spark & Hadoop

®© 2014 MapR Technologies 61

Basic Idea Give pages ranks (scores) based on links to them •  Links from many

pages è high rank •  Link from a high-rank

page è high rank

Image:  en.wikipedia.org/wiki/File:PageRank-­‐hi-­‐res-­‐2.png    

Page 62: Apache Spark & Hadoop

®© 2014 MapR Technologies 62

Algorithm

1.  Start each page at a rank of 1 2.  On each iteration, have page p contribute

rankp / |neighborsp| to its neighbors 3.  Set each page’s rank to 0.15 + 0.85 × contribs

1.0   1.0  

1.0  

1.0  

Page 63: Apache Spark & Hadoop

®© 2014 MapR Technologies 63

Algorithm

1.  Start each page at a rank of 1 2.  On each iteration, have page p contribute

rankp / |neighborsp| to its neighbors 3.  Set each page’s rank to 0.15 + 0.85 × contribs

1.0   1.0  

1.0  

1.0  

1  

0.5  

0.5  

0.5  

1  

0.5  

Page 64: Apache Spark & Hadoop

®© 2014 MapR Technologies 64

Algorithm

1.  Start each page at a rank of 1 2.  On each iteration, have page p contribute

rankp / |neighborsp| to its neighbors 3.  Set each page’s rank to 0.15 + 0.85 × contribs

0.58   1.0  

1.85  

0.58  

Page 65: Apache Spark & Hadoop

®© 2014 MapR Technologies 65

Algorithm

1.  Start each page at a rank of 1 2.  On each iteration, have page p contribute

rankp / |neighborsp| to its neighbors 3.  Set each page’s rank to 0.15 + 0.85 × contribs

0.58  

0.29  

0.29  

0.5  

1.85  0.58   1.0  

1.85  

0.58  

0.5  

Page 66: Apache Spark & Hadoop

®© 2014 MapR Technologies 66

Algorithm

1.  Start each page at a rank of 1 2.  On each iteration, have page p contribute

rankp / |neighborsp| to its neighbors 3.  Set each page’s rank to 0.15 + 0.85 × contribs

0.39   1.72  

1.31  

0.58  

. . .

Page 67: Apache Spark & Hadoop

®© 2014 MapR Technologies 67

Algorithm

1.  Start each page at a rank of 1 2.  On each iteration, have page p contribute

rankp / |neighborsp| to its neighbors 3.  Set each page’s rank to 0.15 + 0.85 × contribs

0.46   1.37  

1.44  

0.73  

Final  state:  

Page 68: Apache Spark & Hadoop

®© 2014 MapR Technologies 68

Scala Implementation

val links = // load RDD of (url, neighbors) pairs var ranks = // give each url rank of 1.0 for (i <- 1 to ITERATIONS) { val contribs = links.join(ranks).values.flatMap { case (urls, rank)) => urls.map(dest => (dest, rank/urls.size)) } ranks = contribs.reduceByKey(_ + _) .mapValues(0.15 + 0.85 * _) } ranks.saveAsTextFile(...) https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/SparkPageRank.scala

Page 69: Apache Spark & Hadoop

®© 2014 MapR Technologies 69 © 2014 MapR Technologies ®

Spark and More

Page 70: Apache Spark & Hadoop

®© 2014 MapR Technologies 70

Easy: Unified Platform

Spark SQL (SQL)

Spark Streaming (Streaming)

MLLib (Machine learning)

Spark (General execution engine)

GraphX (Graph

computation)

Continued innovation bringing new functionality, e.g.,: •  BlinkDB (Approximate Queries) •  SparkR (R wrapper for Spark) •  Tachyon (off-heap RDD caching)

Page 71: Apache Spark & Hadoop

®© 2014 MapR Technologies 71

Spark on MapR •  Certified Spark Distribution •  Fully supported and packaged by MapR in

partnership with Databricks –  mapr-spark package with Spark, Shark, Spark

Streaming today –  Spark-python, GraphX and MLLib soon

•  YARN integration –  Spark can then allocate resources from cluster

when needed

Page 72: Apache Spark & Hadoop

®© 2014 MapR Technologies 72

References •  Based on slides from Pat McDonough at

•  Spark web site: http://spark.apache.org/ •  Spark on MapR:

–  http://www.mapr.com/products/apache-spark –  http://doc.mapr.com/display/MapR/Installing+Spark

+and+Shark

Page 73: Apache Spark & Hadoop

®© 2014 MapR Technologies 73

Q & A

@mapr maprtech

[email protected]

Engage with us!

MapR

maprtech

mapr-technologies