Introduc|on to Apache Spark

1 © Cloudera, Inc. All rights reserved.

Introduc8on to Apache Spark Jordan Volz, Systems Engineer @ Cloudera


Analyzing Data on Large Data Sets • Python, R, etc. are popular tools among data scien8sts/analysts, sta8s8cians, etc. • Why are these tools popular? •  Easy to learn and maximizes produc8vity for data engineers, data scien8sts, sta8s8cians •  Build robust soOware and do interac8ve data analysis •  Large, diverse open source development communi8es •  Comprehensive libraries: data wrangling, ML, visualiza8on, etc.

•  Limita8ons do exist: •  Largely confined to single-‐node analysis and smaller data sets •  Requires sampling or aggrega8ons for larger data •  Distributed tools compromise in various ways – adds complexity and 8me •  Restricts effec8veness in certain use cases


Key Advances by MapReduce:

•  Data Locality: Automa8c split computa8on and launch of mappers appropriately

•  Fault-‐Tolerance: Write out of intermediate results and restartable mappers meant ability to run on commodity hardware

•  Linear Scalability: Combina8on of locality + programming model that forces developers to write generally scalable solu8ons to problems

MapReduce – Analysis on Large Data Sets (Hadoop)

Map Map Map Map Map Map Map Map Map Map Map Map

Reduce Reduce Reduce Reduce


Map Reduce is Not Perfect Map

Reduce Map

Map Reduce

Map

Limited to map-‐reduce paradigm

Lots of I/O à slower jobs Map Reduce Map Reduce

Map Reduce

Itera8ve jobs (ML) à even slower

Redundant joins with SQL Tools


MapReduce on YARN


Death by Pinprick


Apache Spark Flexible, in-‐memory data processing for Hadoop

Easy Development

Flexible Extensible API

Fast Batch & Stream Processing

•  Rich APIs for Scala, Java, and Python

•  Interac8ve shell

•  APIs for different types of workloads: •  Batch (MR) •  Streaming •  Machine Learning •  Graph

•  In-‐Memory processing and caching

Retains: Linear Scalability, Fault-‐Tolerance, Data Locality


Spark Basics • Distributed cluster framework (like MR), running tasks in parallel across a cluster • Tasks operate in-‐memory, spill to disk when memory exceeded. • Resilient Distributed Datasets (RDD): Read-‐only par88oned collec8on of records • RDDs ac8onable through parallel transforma8ons and ac8ons • Lazy materializa8on op8mizes resources • RDD lineage from storage to compute and caching layer provides fault-‐tolerance • Users control persistence and par88oning


Fast Processing Using RAM, Operator Graphs

In-‐Memory Caching •  Data Par88ons read from RAM

instead of disk Operator Graphs •  Scheduling Op8miza8ons •  Fault Tolerance

join

filter

groupBy

B: B:

C: D: E:

F:

Ç√Ω

map

A:

map

take

= cached par88on = RDD


Logis8c Regression Performance (Data Fits in Memory)

0

500

1000

1500

2000

2500

3000

3500

4000

1 5 10 20 30

Runn

ing Time(s)

# of IteraMons

MapReduce

Spark

110 s/itera8on

First itera8on = 80s Further itera8ons 1s due to caching


Spark on YARN


Spark will replace MapReduce To become the standard execu8on engine for Hadoop

Spark public static class WordCountMapClass extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { String line = value.toString(); StringTokenizer itr = new StringTokenizer(line); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); output.collect(word, one); } } } public static class WorkdCountReduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { int sum = 0; while (values.hasNext()) { sum += values.next().get(); } output.collect(key, new IntWritable(sum)); } }

Hadoop MapReduce val spark = new SparkContext(master, appName, [sparkHome], [jars]) val file = spark.textFile("hdfs://...") val counts = file.flatMap(line => line.split(" ")) .map(word => (word, 1)) .reduceByKey(_ + _) counts.saveAsTextFile("hdfs://...")


The Future of Data Processing on Hadoop Spark complemented by specialized fit-‐for-‐purpose engines

General Data Processing w/Spark

Fast Batch Processing, Machine Learning, and Stream Processing

AnalyMc Database w/

Impala Low-‐Latency

Massively Concurrent Queries

Full-‐Text Search w/Solr Querying textual data

On-‐Disk Processing w/MapReduce Jobs at extreme scale and extremely disk IO intensive

Shared: •  Data Storage •  Metadata •  Resource

Management •  Administra8on •  Security •  Governance


Easy Development High Produc8vity Language Support

• Na8ve support for mul8ple languages with iden8cal APIs • Scala, Java, Python

• Use of closures, itera8ons, and other common language constructs to minimize code • 2-‐5x less code

Python lines = sc.textFile(...) lines.filter(lambda s: “ERROR” in s).count()

Scala val lines = sc.textFile(...) lines.filter(s => s.contains(“ERROR”)).count()

Java JavaRDD<String> lines = sc.textFile(...); lines.filter(new Function<String, Boolean>() { Boolean call(String s) { return s.contains(“error”); } }).count();


Easy Development Use Interac8vely

•  Interac8ve explora8on of data for data scien8sts • No need to develop “applica8ons”

• Developers can prototype applica8on on live system

percolateur:spark srowen$ ./bin/spark-shell --master local[*]...Welcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 1.5.0-SNAPSHOT /_/

Using Scala version 2.10.4 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_51)Type in expressions to have them evaluated.Type :help for more information....

scala> val words = sc.textFile("file:/usr/share/dict/words")...words: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[1] at textFile at <console>:21

scala> words.count...res0: Long = 235886

scala>


Easy Development Expressive API •  map

•  filter

•  groupBy

•  sort

•  union

•  join

•  leftOuterJoin

•  rightOuterJoin

•  sample

•  take

•  first

•  partitionBy

•  mapWith

•  pipe

•  save

•  …

•  reduce

•  count

•  fold

•  reduceByKey

•  groupByKey

•  cogroup

•  cross

•  zip


Example Logis8c Regression

data = spark.textFile(...).map(readPoint).cache() w = numpy.random.rand(D) for i in range(iterations): gradient = data .map(lambda p: (1 / (1 + exp(-p.y * w.dot(p.x)))) * p.y * p.x) .reduce(lambda x, y: x + y) w -= gradient print “Final w: %s” % w


The Spark Ecosystem & Hadoop

Spark Streaming MLlib SparkSQL GraphX Data-‐

frames SparkR

STORAGE HDFS, HBase

RESOURCE MANAGEMENT YARN

Spark Impala MR Others Search


One Plauorm, Many Workloads

Batch, Interac8ve, and Real-‐Time. Leading performance and usability in one plauorm.

•  End-‐to-‐end analy8c workflows •  Access more data •  Work with data in new ways •  Enable new users

Security and Administra8on

Process Ingest

Sqoop, Flume, Kaxa, Spark Streaming Transform MapReduce,

Hive, Pig, Spark

Discover Analy8c Database

Impala

Search Solr

Model Machine Learning SAS, R, Spark,

Mahout

Serve NoSQL Database

HBase

Streaming Spark Streaming

Unlimited Storage HDFS, HBase

YARN, Cloudera Manager, Cloudera Navigator


Cloudera Customer Use Cases

Core Spark

Spark Streaming

•  Poruolio Risk Analysis •  ETL Pipeline Speed-‐Up •  20+ years of stock data Financial

Services

Health

•  Iden8fy disease-‐causing genes in the full human genome

•  Calculate Jaccard scores on health care data sets

ERP

•  Op8cal Character Recogni8on and Bill Classifica8on

•  Trend analysis •  Document classifica8on (LDA) •  Fraud analy8cs Data

Services

1010

•  Online Fraud Detec8on Financial Services

Ad Tech

•  Real-‐Time Ad Performance Analysis

Over 150 customers using Spark Spark clusters as large as 800 nodes


Uni8ng Spark and Hadoop The One Plauorm Ini8a8ve Investment Areas

Management Leverage Hadoop-‐na8ve resource management.

Security Full support for Hadoop security

and beyond.

Scale Enable 10k-‐node clusters.

Streaming Support for 80% of common stream

processing workloads.


Spark Resources •  Learn Spark • O’Reilly Advanced Analy8cs with Spark eBook (wri{en by Clouderans) • Cloudera Developer Blog • cloudera.com/spark

• Get Trained • Cloudera Spark Training

• Try it Out • Cloudera Live Spark Tutorial


Thank You [email protected] linkedin.com/in/jordan.volz

Documents

Introduc|on to Apache Spark