Download pdf - Apache Spark & Hadoop

Transcript
Page 1: Apache Spark & Hadoop

®© 2014 MapR Technologies 1

®

© 2014 MapR Technologies

Apache Spark

Keys Botzum Senior Principal Technologist, MapR Technologies

June 2014

Page 2: Apache Spark & Hadoop

®© 2014 MapR Technologies 2

Agenda •  MapReduce •  Apache Spark •  How Spark Works •  Fault Tolerance and Performance •  Examples •  Spark and More

Page 3: Apache Spark & Hadoop

®© 2014 MapR Technologies 3

MapR: Best Product, Best Business & Best Customers

Top Ranked Exponential Growth

500+ Customers Cloud Leaders

3X bookings Q1 ‘13 – Q1 ‘14

80% of accounts expand 3X

90% software licenses

< 1% lifetime churn

> $1B in incremental revenue generated by 1 customer

Page 4: Apache Spark & Hadoop

®© 2014 MapR Technologies 4 © 2014 MapR Technologies ®

Review: MapReduce

Page 5: Apache Spark & Hadoop

®© 2014 MapR Technologies 5

MapReduce: A Programming Model

•  MapReduce: Simplified Data Processing on Large Clusters (published 2004)

•  Parallel and Distributed Algorithm:

•  Data Locality •  Fault Tolerance •  Linear Scalability

Page 6: Apache Spark & Hadoop

®© 2014 MapR Technologies 6

MapReduce Basics •  Assumes scalable distributed file system that

shards data •  Map

–  Loading of the data and defining a set of keys

•  Reduce –  Collects the organized key-based data to process

and output •  Performance can be tweaked based on known

details of your source files and cluster shape (size, total number)

Page 7: Apache Spark & Hadoop

®© 2014 MapR Technologies 7

MapReduce Processing Model •  Define mappers •  Shuffling is automatic •  Define reducers •  For complex work, chain jobs together

Page 8: Apache Spark & Hadoop

®© 2014 MapR Technologies 8

MapReduce: The Good

•  Built in fault tolerance •  Optimized IO path •  Scalable •  Developer focuses on Map/Reduce, not

infrastructure •  simple? API

Page 9: Apache Spark & Hadoop

®© 2014 MapR Technologies 9

MapReduce: The Bad

•  Optimized for disk IO –  Doesn’t leverage memory well –  Iterative algorithms go through disk IO path again

and again •  Primitive API

–  Developer’s have to build on very simple abstraction –  Key/Value in/out –  Even basic things like join require extensive code

•  Result often many files that need to be combined appropriately

Page 10: Apache Spark & Hadoop

®© 2014 MapR Technologies 10 © 2014 MapR Technologies ®

Apache Spark

Page 11: Apache Spark & Hadoop

®© 2014 MapR Technologies 11

Apache Spark

•  spark.apache.org •  github.com/apache/spark •  [email protected]

•  Originally developed in 2009 in UC Berkeley’s AMP Lab

•  Fully open sourced in 2010 – now at Apache Software Foundation

- Commercial Vendor Developing/Supporting

Page 12: Apache Spark & Hadoop

®© 2014 MapR Technologies 12

Spark: Easy and Fast Big Data

•  Easy to Develop –  Rich APIs in

Java, Scala, Python

–  Interactive shell

•  Fast to Run –  General execution

graphs –  In-memory storage

2-5× less code

Page 13: Apache Spark & Hadoop

®© 2014 MapR Technologies 13

Resilient Distributed Datasets (RDD) •  Spark revolves around RDDs •  Fault-tolerant read only collection of elements

that can be operated on in parallel •  Cached in memory or on disk http://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf

Page 14: Apache Spark & Hadoop

®© 2014 MapR Technologies 14

RDD Operations - Expressive •  Transformations

–  Creation of a new RDD dataset from an existing •  map, filter, distinct, union, sample, groupByKey, join,

reduce, etc…

•  Actions –  Return a value after running a computation

•  collect, count, first, takeSample, foreach, etc…

Check the documentation for a complete list

http://spark.apache.org/docs/latest/scala-programming-guide.html#rdd-operations

Page 15: Apache Spark & Hadoop

®© 2014 MapR Technologies 15

Easy: Clean API

•  Resilient Distributed Datasets

•  Collections of objects spread across a cluster, stored in RAM or on Disk

•  Built through parallel transformations

•  Automatically rebuilt on failure

•  Operations •  Transformations

(e.g. map, filter, groupBy)

•  Actions (e.g. count, collect, save)

Write programs in terms of transformations on distributed datasets

Page 16: Apache Spark & Hadoop

®© 2014 MapR Technologies 16

Easy: Expressive API

•  map •  reduce

Page 17: Apache Spark & Hadoop

®© 2014 MapR Technologies 17

Easy: Expressive API

•  map

•  filter

•  groupBy

•  sort

•  union

•  join

•  leftOuterJoin

•  rightOuterJoin

•  reduce

•  count

•  fold

•  reduceByKey

•  groupByKey

•  cogroup

•  cross

•  zip

sample

take

first

partitionBy

mapWith

pipe

save ...

Page 18: Apache Spark & Hadoop

®© 2014 MapR Technologies 18

Easy: Example – Word Count

•  Spark •  Hadoop MapReduce public static class WordCountMapClass extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { String line = value.toString(); StringTokenizer itr = new StringTokenizer(line); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); output.collect(word, one); } } } public static class WorkdCountReduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { int sum = 0; while (values.hasNext()) { sum += values.next().get(); } output.collect(key, new IntWritable(sum)); } }

val spark = new SparkContext(master, appName, [sparkHome], [jars]) val file = spark.textFile("hdfs://...") val counts = file.flatMap(line => line.split(" ")) .map(word => (word, 1)) .reduceByKey(_ + _) counts.saveAsTextFile("hdfs://...")

Page 19: Apache Spark & Hadoop

®© 2014 MapR Technologies 19

Easy: Example – Word Count

•  Spark •  Hadoop MapReduce public static class WordCountMapClass extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { String line = value.toString(); StringTokenizer itr = new StringTokenizer(line); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); output.collect(word, one); } } } public static class WorkdCountReduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { int sum = 0; while (values.hasNext()) { sum += values.next().get(); } output.collect(key, new IntWritable(sum)); } }

val spark = new SparkContext(master, appName, [sparkHome], [jars]) val file = spark.textFile("hdfs://...") val counts = file.flatMap(line => line.split(" ")) .map(word => (word, 1)) .reduceByKey(_ + _) counts.saveAsTextFile("hdfs://...")

Page 20: Apache Spark & Hadoop

®© 2014 MapR Technologies 20

Easy: Works Well With Hadoop •  Data Compatibility •  Access your existing

Hadoop Data •  Use the same data

formats •  Adheres to data

locality for efficient processing

•  Deployment Models

•  “Standalone” deployment

•  YARN-based deployment

•  Mesos-based deployment

•  Deploy on existing Hadoop cluster or side-by-side

Page 21: Apache Spark & Hadoop

®© 2014 MapR Technologies 21

Easy: User-Driven Roadmap •  Language support

–  Improved Python support

–  SparkR –  Java 8 –  Integrated Schema

and SQL support in Spark’s APIs

•  Better ML –  Sparse Data

Support –  Model Evaluation

Framework –  Performance Testing

Page 22: Apache Spark & Hadoop

®© 2014 MapR Technologies 22

Example: Logistic Regression

data = spark.textFile(...).map(readPoint).cache() w = numpy.random.rand(D) for i in range(iterations): gradient = data .map(lambda p: (1 / (1 + exp(-p.y * w.dot(p.x))))

* p.y * p.x) .reduce(lambda x, y: x + y)

w -= gradient print “Final w: %s” % w

Page 23: Apache Spark & Hadoop

®© 2014 MapR Technologies 23

Fast: Logistic Regression Performance

0 500

1000 1500 2000 2500 3000 3500 4000

1 5 10 20 30

Run

ning

Tim

e (s

)

Number of Iterations

Hadoop

Spark

110  s  /  iteration  

first  iteration  80  s  further  iterations  1  s  

Page 24: Apache Spark & Hadoop

®© 2014 MapR Technologies 24

Easy: Multi-language Support Python lines = sc.textFile(...) lines.filter(lambda s: “ERROR” in s).count()

Scala val lines = sc.textFile(...) lines.filter(x => x.contains(“ERROR”)).count()

Java JavaRDD<String> lines = sc.textFile(...); lines.filter(new Function<String, Boolean>() { Boolean call(String s) { return s.contains(“error”); } }).count();

Page 25: Apache Spark & Hadoop

®© 2014 MapR Technologies 25

Easy: Interactive Shell Scala based shell % /opt/mapr/spark/spark-0.9.1/bin/spark-shell scala> val logs = sc.textFile("hdfs:///user/keys/logdata”)"scala> logs.count()"…"res0: Long = 232681

scala> logs.filter(l => l.contains("ERROR")).count()"…."res1: Long = 205

Python based shell as well - pyspark

Page 26: Apache Spark & Hadoop

®© 2014 MapR Technologies 26 © 2014 MapR Technologies ®

Fault Tolerance and Performance

Page 27: Apache Spark & Hadoop

®© 2014 MapR Technologies 27

Fast: Using RAM, Operator Graphs

•  In-memory Caching •  Data Partitions read

from RAM instead of disk

•  Operator Graphs •  Scheduling

Optimizations •  Fault Tolerance

=  cached  partition  

=  RDD  

join  

filter  

groupBy  

Stage  3  

Stage  1  

Stage  2  

A:   B:  

C:   D:   E:  

F:  

map  

Page 28: Apache Spark & Hadoop

®© 2014 MapR Technologies 28

Directed Acylic Graph (DAG) •  Directed

–  Only in a single direction

•  Acyclic –  No looping

•  This supports fault-tolerance

Page 29: Apache Spark & Hadoop

®© 2014 MapR Technologies 29

Easy: Fault Recovery

RDDs track lineage information that can be used to efficiently recompute lost data

msgs = textFile.filter(lambda s: s.startsWith(“ERROR”)) .map(lambda s: s.split(“\t”)[2])

HDFS File Filtered RDD Mapped RDD filter  

(func  =  startsWith(…))  map  

(func  =  split(...))  

Page 30: Apache Spark & Hadoop

®© 2014 MapR Technologies 30

RDD Persistence / Caching •  Variety of storage levels

–  memory_only (default), memory_and_disk, etc… •  API Calls

–  persist(StorageLevel) –  cache() – shorthand for

persist(StorageLevel.MEMORY_ONLY) •  Considerations

–  Read from disk vs. recompute (memory_and_disk) –  Total memory storage size (memory_only_ser) –  Replicate to second node for faster fault recovery

(memory_only_2) •  Think about this option if supporting a time sensitive client

http://spark.apache.org/docs/latest/scala-programming-guide.html#rdd-persistence

Page 31: Apache Spark & Hadoop

®© 2014 MapR Technologies 31

PageRank Performance

171

80

23

14 0

50

100

150

200

30 60

Itera

tion

time

(s)

Number of machines

Hadoop

Spark

Page 32: Apache Spark & Hadoop

®© 2014 MapR Technologies 32

Other Iterative Algorithms

0.96 110

0 25 50 75 100 125

Logistic Regression

4.1 155

0 30 60 90 120 150 180

K-Means Clustering

Hadoop

Spark

Time per Iteration (s)

Page 33: Apache Spark & Hadoop

®© 2014 MapR Technologies 33

Fast: Scaling Down

69  

58  

41  

30  

12  

0  

20  

40  

60  

80  

100  

Cache  disabled  

25%   50%   75%   Fully  cached  

Exec

ution  time  (s)  

%  of  working  set  in  cache  

Page 34: Apache Spark & Hadoop

®© 2014 MapR Technologies 34

Comparison to Storm •  Higher throughput than Storm

–  Spark Streaming: 670k records/sec/node –  Storm: 115k records/sec/node –  Commercial systems: 100-500k records/sec/node

0  

10  

20  

30  

100   1000  

Throughp

ut  per  nod

e  (M

B/s)  

Record  Size  (bytes)  

WordCount  

Spark  

Storm  

0  

20  

40  

60  

100   1000  

Throughp

ut  per  nod

e  (M

B/s)  

Record  Size  (bytes)  

Grep  

Spark  

Storm  

Page 35: Apache Spark & Hadoop

®© 2014 MapR Technologies 35 © 2014 MapR Technologies ®

How Spark Works

Page 36: Apache Spark & Hadoop

®© 2014 MapR Technologies 36

Working With RDDs

Page 37: Apache Spark & Hadoop

®© 2014 MapR Technologies 37

Working With RDDs

RDD

textFile = sc.textFile(”SomeFile.txt”) !

Page 38: Apache Spark & Hadoop

®© 2014 MapR Technologies 38

Working With RDDs

RDD RDD RDD RDD

Transformations

linesWithSpark = textFile.filter(lambda line: "Spark” in line) !

textFile = sc.textFile(”SomeFile.txt”) !

Page 39: Apache Spark & Hadoop

®© 2014 MapR Technologies 39

Working With RDDs

RDD RDD RDD RDD

Transformations

Action Value

linesWithSpark = textFile.filter(lambda line: "Spark” in line) !

linesWithSpark.count()!74!!linesWithSpark.first()!# Apache Spark!

textFile = sc.textFile(”SomeFile.txt”) !

Page 40: Apache Spark & Hadoop

®© 2014 MapR Technologies 40 © 2014 MapR Technologies ®

Example: Log Mining

Page 41: Apache Spark & Hadoop

®© 2014 MapR Technologies 41

Example: Log Mining Load error messages from a log into memory, then

interactively search for various patterns

Page 42: Apache Spark & Hadoop

®© 2014 MapR Technologies 42

Example: Log Mining Load error messages from a log into memory, then

interactively search for various patterns

Worker

Worker

Worker

Driver

Page 43: Apache Spark & Hadoop

®© 2014 MapR Technologies 43

Example: Log Mining Load error messages from a log into memory, then

interactively search for various patterns

Worker

Worker

Worker

Driver

lines = spark.textFile(“hdfs://...”)

Page 44: Apache Spark & Hadoop

®© 2014 MapR Technologies 44

Example: Log Mining Load error messages from a log into memory, then

interactively search for various patterns

Worker

Worker

Worker

Driver

lines = spark.textFile(“hdfs://...”)

Base RDD

Page 45: Apache Spark & Hadoop

®© 2014 MapR Technologies 45

Example: Log Mining Load error messages from a log into memory, then

interactively search for various patterns

lines = spark.textFile(“hdfs://...”)

errors = lines.filter(lambda s: s.startswith(“ERROR”))

Worker

Worker

Worker

Driver

Page 46: Apache Spark & Hadoop

®© 2014 MapR Technologies 46

Example: Log Mining Load error messages from a log into memory, then

interactively search for various patterns

lines = spark.textFile(“hdfs://...”)

errors = lines.filter(lambda s: s.startswith(“ERROR”))

Worker

Worker

Worker

Driver

Transformed RDD

Page 47: Apache Spark & Hadoop

®© 2014 MapR Technologies 47

Example: Log Mining Load error messages from a log into memory, then

interactively search for various patterns

lines = spark.textFile(“hdfs://...”)

errors = lines.filter(lambda s: s.startswith(“ERROR”))

messages = errors.map(lambda s: s.split(“\t”)[2])

messages.cache()

Worker

Worker

Worker

Driver

messages.filter(lambda s: “mysql” in s).count()

Page 48: Apache Spark & Hadoop

®© 2014 MapR Technologies 48

Example: Log Mining Load error messages from a log into memory, then

interactively search for various patterns

lines = spark.textFile(“hdfs://...”)

errors = lines.filter(lambda s: s.startswith(“ERROR”))

messages = errors.map(lambda s: s.split(“\t”)[2])

messages.cache()

Worker

Worker

Worker

Driver

messages.filter(lambda s: “mysql” in s).count() Action

Page 49: Apache Spark & Hadoop

®© 2014 MapR Technologies 49

Example: Log Mining Load error messages from a log into memory, then

interactively search for various patterns

lines = spark.textFile(“hdfs://...”)

errors = lines.filter(lambda s: s.startswith(“ERROR”))

messages = errors.map(lambda s: s.split(“\t”)[2])

messages.cache()

Worker

Worker

Worker

Driver

messages.filter(lambda s: “mysql” in s).count()

Block 1

Block 2

Block 3

Page 50: Apache Spark & Hadoop

®© 2014 MapR Technologies 50

Example: Log Mining Load error messages from a log into memory, then

interactively search for various patterns

lines = spark.textFile(“hdfs://...”)

errors = lines.filter(lambda s: s.startswith(“ERROR”))

messages = errors.map(lambda s: s.split(“\t”)[2])

messages.cache()

Worker

Worker

Worker messages.filter(lambda s: “mysql” in s).count()

Block 1

Block 2

Block 3

Driver tasks

tasks

tasks

Page 51: Apache Spark & Hadoop

®© 2014 MapR Technologies 51

Example: Log Mining Load error messages from a log into memory, then

interactively search for various patterns

lines = spark.textFile(“hdfs://...”)

errors = lines.filter(lambda s: s.startswith(“ERROR”))

messages = errors.map(lambda s: s.split(“\t”)[2])

messages.cache()

Worker

Worker

Worker messages.filter(lambda s: “mysql” in s).count()

Block 1

Block 2

Block 3

Driver

Read HDFS Block

Read HDFS Block

Read HDFS Block

Page 52: Apache Spark & Hadoop

®© 2014 MapR Technologies 52

Example: Log Mining Load error messages from a log into memory, then

interactively search for various patterns

lines = spark.textFile(“hdfs://...”)

errors = lines.filter(lambda s: s.startswith(“ERROR”))

messages = errors.map(lambda s: s.split(“\t”)[2])

messages.cache()

Worker

Worker

Worker messages.filter(lambda s: “mysql” in s).count()

Block 1

Block 2

Block 3

Driver

Cache 1

Cache 2

Cache 3

Process & Cache Data

Process & Cache Data

Process & Cache Data

Page 53: Apache Spark & Hadoop

®© 2014 MapR Technologies 53

Example: Log Mining Load error messages from a log into memory, then

interactively search for various patterns

lines = spark.textFile(“hdfs://...”)

errors = lines.filter(lambda s: s.startswith(“ERROR”))

messages = errors.map(lambda s: s.split(“\t”)[2])

messages.cache()

Worker

Worker

Worker messages.filter(lambda s: “mysql” in s).count()

Block 1

Block 2

Block 3

Driver

Cache 1

Cache 2

Cache 3

results

results

results

Page 54: Apache Spark & Hadoop

®© 2014 MapR Technologies 54

Example: Log Mining Load error messages from a log into memory, then

interactively search for various patterns

lines = spark.textFile(“hdfs://...”)

errors = lines.filter(lambda s: s.startswith(“ERROR”))

messages = errors.map(lambda s: s.split(“\t”)[2])

messages.cache()

Worker

Worker

Worker messages.filter(lambda s: “mysql” in s).count()

Block 1

Block 2

Block 3

Driver

Cache 1

Cache 2

Cache 3

messages.filter(lambda s: “php” in s).count()

Page 55: Apache Spark & Hadoop

®© 2014 MapR Technologies 55

Example: Log Mining Load error messages from a log into memory, then

interactively search for various patterns

lines = spark.textFile(“hdfs://...”)

errors = lines.filter(lambda s: s.startswith(“ERROR”))

messages = errors.map(lambda s: s.split(“\t”)[2])

messages.cache()

Worker

Worker

Worker messages.filter(lambda s: “mysql” in s).count()

Block 1

Block 2

Block 3

Cache 1

Cache 2

Cache 3

messages.filter(lambda s: “php” in s).count()

tasks

tasks

tasks

Driver

Page 56: Apache Spark & Hadoop

®© 2014 MapR Technologies 56

Example: Log Mining Load error messages from a log into memory, then

interactively search for various patterns

lines = spark.textFile(“hdfs://...”)

errors = lines.filter(lambda s: s.startswith(“ERROR”))

messages = errors.map(lambda s: s.split(“\t”)[2])

messages.cache()

Worker

Worker

Worker messages.filter(lambda s: “mysql” in s).count()

Block 1

Block 2

Block 3

Cache 1

Cache 2

Cache 3

messages.filter(lambda s: “php” in s).count()

Driver

Process from Cache

Process from Cache

Process from Cache

Page 57: Apache Spark & Hadoop

®© 2014 MapR Technologies 57

Example: Log Mining Load error messages from a log into memory, then

interactively search for various patterns

lines = spark.textFile(“hdfs://...”)

errors = lines.filter(lambda s: s.startswith(“ERROR”))

messages = errors.map(lambda s: s.split(“\t”)[2])

messages.cache()

Worker

Worker

Worker messages.filter(lambda s: “mysql” in s).count()

Block 1

Block 2

Block 3

Cache 1

Cache 2

Cache 3

messages.filter(lambda s: “php” in s).count()

Driver results

results

results

Page 58: Apache Spark & Hadoop

®© 2014 MapR Technologies 58

Example: Log Mining Load error messages from a log into memory, then

interactively search for various patterns

lines = spark.textFile(“hdfs://...”)

errors = lines.filter(lambda s: s.startswith(“ERROR”))

messages = errors.map(lambda s: s.split(“\t”)[2])

messages.cache()

Worker

Worker

Worker messages.filter(lambda s: “mysql” in s).count()

Block 1

Block 2

Block 3

Cache 1

Cache 2

Cache 3

messages.filter(lambda s: “php” in s).count()

Driver

Cache your data è Faster Results Full-text search of Wikipedia •  60GB on 20 EC2 machines •  0.5 sec from cache vs. 20s for on-disk

Page 59: Apache Spark & Hadoop

®© 2014 MapR Technologies 59 © 2014 MapR Technologies ®

Example: Page Rank

Page 60: Apache Spark & Hadoop

®© 2014 MapR Technologies 60

Example: PageRank •  Good example of a more complex algorithm

–  Multiple stages of map & reduce

•  Benefits from Spark’s in-memory caching –  Multiple iterations over the same data

Page 61: Apache Spark & Hadoop

®© 2014 MapR Technologies 61

Basic Idea Give pages ranks (scores) based on links to them •  Links from many

pages è high rank •  Link from a high-rank

page è high rank

Image:  en.wikipedia.org/wiki/File:PageRank-­‐hi-­‐res-­‐2.png    

Page 62: Apache Spark & Hadoop

®© 2014 MapR Technologies 62

Algorithm

1.  Start each page at a rank of 1 2.  On each iteration, have page p contribute

rankp / |neighborsp| to its neighbors 3.  Set each page’s rank to 0.15 + 0.85 × contribs

1.0   1.0  

1.0  

1.0  

Page 63: Apache Spark & Hadoop

®© 2014 MapR Technologies 63

Algorithm

1.  Start each page at a rank of 1 2.  On each iteration, have page p contribute

rankp / |neighborsp| to its neighbors 3.  Set each page’s rank to 0.15 + 0.85 × contribs

1.0   1.0  

1.0  

1.0  

1  

0.5  

0.5  

0.5  

1  

0.5  

Page 64: Apache Spark & Hadoop

®© 2014 MapR Technologies 64

Algorithm

1.  Start each page at a rank of 1 2.  On each iteration, have page p contribute

rankp / |neighborsp| to its neighbors 3.  Set each page’s rank to 0.15 + 0.85 × contribs

0.58   1.0  

1.85  

0.58  

Page 65: Apache Spark & Hadoop

®© 2014 MapR Technologies 65

Algorithm

1.  Start each page at a rank of 1 2.  On each iteration, have page p contribute

rankp / |neighborsp| to its neighbors 3.  Set each page’s rank to 0.15 + 0.85 × contribs

0.58  

0.29  

0.29  

0.5  

1.85  0.58   1.0  

1.85  

0.58  

0.5  

Page 66: Apache Spark & Hadoop

®© 2014 MapR Technologies 66

Algorithm

1.  Start each page at a rank of 1 2.  On each iteration, have page p contribute

rankp / |neighborsp| to its neighbors 3.  Set each page’s rank to 0.15 + 0.85 × contribs

0.39   1.72  

1.31  

0.58  

. . .

Page 67: Apache Spark & Hadoop

®© 2014 MapR Technologies 67

Algorithm

1.  Start each page at a rank of 1 2.  On each iteration, have page p contribute

rankp / |neighborsp| to its neighbors 3.  Set each page’s rank to 0.15 + 0.85 × contribs

0.46   1.37  

1.44  

0.73  

Final  state:  

Page 68: Apache Spark & Hadoop

®© 2014 MapR Technologies 68

Scala Implementation

val links = // load RDD of (url, neighbors) pairs var ranks = // give each url rank of 1.0 for (i <- 1 to ITERATIONS) { val contribs = links.join(ranks).values.flatMap { case (urls, rank)) => urls.map(dest => (dest, rank/urls.size)) } ranks = contribs.reduceByKey(_ + _) .mapValues(0.15 + 0.85 * _) } ranks.saveAsTextFile(...) https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/SparkPageRank.scala

Page 69: Apache Spark & Hadoop

®© 2014 MapR Technologies 69 © 2014 MapR Technologies ®

Spark and More

Page 70: Apache Spark & Hadoop

®© 2014 MapR Technologies 70

Easy: Unified Platform

Spark SQL (SQL)

Spark Streaming (Streaming)

MLLib (Machine learning)

Spark (General execution engine)

GraphX (Graph

computation)

Continued innovation bringing new functionality, e.g.,: •  BlinkDB (Approximate Queries) •  SparkR (R wrapper for Spark) •  Tachyon (off-heap RDD caching)

Page 71: Apache Spark & Hadoop

®© 2014 MapR Technologies 71

Spark on MapR •  Certified Spark Distribution •  Fully supported and packaged by MapR in

partnership with Databricks –  mapr-spark package with Spark, Shark, Spark

Streaming today –  Spark-python, GraphX and MLLib soon

•  YARN integration –  Spark can then allocate resources from cluster

when needed

Page 72: Apache Spark & Hadoop

®© 2014 MapR Technologies 72

References •  Based on slides from Pat McDonough at

•  Spark web site: http://spark.apache.org/ •  Spark on MapR:

–  http://www.mapr.com/products/apache-spark –  http://doc.mapr.com/display/MapR/Installing+Spark

+and+Shark

Page 73: Apache Spark & Hadoop

®© 2014 MapR Technologies 73

Q & A

@mapr maprtech

[email protected]

Engage with us!

MapR

maprtech

mapr-technologies


Recommended