Upload
lamtram
View
230
Download
1
Embed Size (px)
Citation preview
1 © Cloudera, Inc. All rights reserved.
Introduc8on to Apache Spark Jordan Volz, Systems Engineer @ Cloudera
2 © Cloudera, Inc. All rights reserved.
Analyzing Data on Large Data Sets • Python, R, etc. are popular tools among data scien8sts/analysts, sta8s8cians, etc. • Why are these tools popular? • Easy to learn and maximizes produc8vity for data engineers, data scien8sts, sta8s8cians • Build robust soOware and do interac8ve data analysis • Large, diverse open source development communi8es • Comprehensive libraries: data wrangling, ML, visualiza8on, etc.
• Limita8ons do exist: • Largely confined to single-‐node analysis and smaller data sets • Requires sampling or aggrega8ons for larger data • Distributed tools compromise in various ways – adds complexity and 8me • Restricts effec8veness in certain use cases
3 © Cloudera, Inc. All rights reserved.
Key Advances by MapReduce:
• Data Locality: Automa8c split computa8on and launch of mappers appropriately
• Fault-‐Tolerance: Write out of intermediate results and restartable mappers meant ability to run on commodity hardware
• Linear Scalability: Combina8on of locality + programming model that forces developers to write generally scalable solu8ons to problems
MapReduce – Analysis on Large Data Sets (Hadoop)
Map Map Map Map Map Map Map Map Map Map Map Map
Reduce Reduce Reduce Reduce
4 © Cloudera, Inc. All rights reserved.
Map Reduce is Not Perfect Map
Reduce Map
Map Reduce
Map
Limited to map-‐reduce paradigm
Lots of I/O à slower jobs Map Reduce Map Reduce
Map Reduce
Itera8ve jobs (ML) à even slower
Redundant joins with SQL Tools
5 © Cloudera, Inc. All rights reserved.
MapReduce on YARN
6 © Cloudera, Inc. All rights reserved.
Death by Pinprick
7 © Cloudera, Inc. All rights reserved.
Apache Spark Flexible, in-‐memory data processing for Hadoop
Easy Development
Flexible Extensible API
Fast Batch & Stream Processing
• Rich APIs for Scala, Java, and Python
• Interac8ve shell
• APIs for different types of workloads: • Batch (MR) • Streaming • Machine Learning • Graph
• In-‐Memory processing and caching
Retains: Linear Scalability, Fault-‐Tolerance, Data Locality
8 © Cloudera, Inc. All rights reserved.
Spark Basics • Distributed cluster framework (like MR), running tasks in parallel across a cluster • Tasks operate in-‐memory, spill to disk when memory exceeded. • Resilient Distributed Datasets (RDD): Read-‐only par88oned collec8on of records • RDDs ac8onable through parallel transforma8ons and ac8ons • Lazy materializa8on op8mizes resources • RDD lineage from storage to compute and caching layer provides fault-‐tolerance • Users control persistence and par88oning
9 © Cloudera, Inc. All rights reserved.
Fast Processing Using RAM, Operator Graphs
In-‐Memory Caching • Data Par88ons read from RAM
instead of disk Operator Graphs • Scheduling Op8miza8ons • Fault Tolerance
join
filter
groupBy
B: B:
C: D: E:
F:
Ç√Ω
map
A:
map
take
= cached par88on = RDD
10 © Cloudera, Inc. All rights reserved.
Logis8c Regression Performance (Data Fits in Memory)
0
500
1000
1500
2000
2500
3000
3500
4000
1 5 10 20 30
Runn
ing Time(s)
# of IteraMons
MapReduce
Spark
110 s/itera8on
First itera8on = 80s Further itera8ons 1s due to caching
11 © Cloudera, Inc. All rights reserved.
Spark on YARN
12 © Cloudera, Inc. All rights reserved.
Spark will replace MapReduce To become the standard execu8on engine for Hadoop
Spark public static class WordCountMapClass extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { String line = value.toString(); StringTokenizer itr = new StringTokenizer(line); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); output.collect(word, one); } } } public static class WorkdCountReduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { int sum = 0; while (values.hasNext()) { sum += values.next().get(); } output.collect(key, new IntWritable(sum)); } }
Hadoop MapReduce val spark = new SparkContext(master, appName, [sparkHome], [jars]) val file = spark.textFile("hdfs://...") val counts = file.flatMap(line => line.split(" ")) .map(word => (word, 1)) .reduceByKey(_ + _) counts.saveAsTextFile("hdfs://...")
13 © Cloudera, Inc. All rights reserved.
The Future of Data Processing on Hadoop Spark complemented by specialized fit-‐for-‐purpose engines
General Data Processing w/Spark
Fast Batch Processing, Machine Learning, and Stream Processing
AnalyMc Database w/
Impala Low-‐Latency
Massively Concurrent Queries
Full-‐Text Search w/Solr Querying textual data
On-‐Disk Processing w/MapReduce Jobs at extreme scale and extremely disk IO intensive
Shared: • Data Storage • Metadata • Resource
Management • Administra8on • Security • Governance
14 © Cloudera, Inc. All rights reserved.
Easy Development High Produc8vity Language Support
• Na8ve support for mul8ple languages with iden8cal APIs • Scala, Java, Python
• Use of closures, itera8ons, and other common language constructs to minimize code • 2-‐5x less code
Python lines = sc.textFile(...) lines.filter(lambda s: “ERROR” in s).count()
Scala val lines = sc.textFile(...) lines.filter(s => s.contains(“ERROR”)).count()
Java JavaRDD<String> lines = sc.textFile(...); lines.filter(new Function<String, Boolean>() { Boolean call(String s) { return s.contains(“error”); } }).count();
15 © Cloudera, Inc. All rights reserved.
Easy Development Use Interac8vely
• Interac8ve explora8on of data for data scien8sts • No need to develop “applica8ons”
• Developers can prototype applica8on on live system
percolateur:spark srowen$ ./bin/spark-shell --master local[*]...Welcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 1.5.0-SNAPSHOT /_/
Using Scala version 2.10.4 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_51)Type in expressions to have them evaluated.Type :help for more information....
scala> val words = sc.textFile("file:/usr/share/dict/words")...words: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[1] at textFile at <console>:21
scala> words.count...res0: Long = 235886
scala>
16 © Cloudera, Inc. All rights reserved.
Easy Development Expressive API • map
• filter
• groupBy
• sort
• union
• join
• leftOuterJoin
• rightOuterJoin
• sample
• take
• first
• partitionBy
• mapWith
• pipe
• save
• …
• reduce
• count
• fold
• reduceByKey
• groupByKey
• cogroup
• cross
• zip
17 © Cloudera, Inc. All rights reserved.
Example Logis8c Regression
data = spark.textFile(...).map(readPoint).cache() w = numpy.random.rand(D) for i in range(iterations): gradient = data .map(lambda p: (1 / (1 + exp(-p.y * w.dot(p.x)))) * p.y * p.x) .reduce(lambda x, y: x + y) w -= gradient print “Final w: %s” % w
18 © Cloudera, Inc. All rights reserved.
The Spark Ecosystem & Hadoop
Spark Streaming MLlib SparkSQL GraphX Data-‐
frames SparkR
STORAGE HDFS, HBase
RESOURCE MANAGEMENT YARN
Spark Impala MR Others Search
19 © Cloudera, Inc. All rights reserved.
One Plauorm, Many Workloads
Batch, Interac8ve, and Real-‐Time. Leading performance and usability in one plauorm.
• End-‐to-‐end analy8c workflows • Access more data • Work with data in new ways • Enable new users
Security and Administra8on
Process Ingest
Sqoop, Flume, Kaxa, Spark Streaming Transform MapReduce,
Hive, Pig, Spark
Discover Analy8c Database
Impala
Search Solr
Model Machine Learning SAS, R, Spark,
Mahout
Serve NoSQL Database
HBase
Streaming Spark Streaming
Unlimited Storage HDFS, HBase
YARN, Cloudera Manager, Cloudera Navigator
20 © Cloudera, Inc. All rights reserved.
Cloudera Customer Use Cases
Core Spark
Spark Streaming
• Poruolio Risk Analysis • ETL Pipeline Speed-‐Up • 20+ years of stock data Financial
Services
Health
• Iden8fy disease-‐causing genes in the full human genome
• Calculate Jaccard scores on health care data sets
ERP
• Op8cal Character Recogni8on and Bill Classifica8on
• Trend analysis • Document classifica8on (LDA) • Fraud analy8cs Data
Services
1010
• Online Fraud Detec8on Financial Services
Ad Tech
• Real-‐Time Ad Performance Analysis
Over 150 customers using Spark Spark clusters as large as 800 nodes
21 © Cloudera, Inc. All rights reserved.
Uni8ng Spark and Hadoop The One Plauorm Ini8a8ve Investment Areas
Management Leverage Hadoop-‐na8ve resource management.
Security Full support for Hadoop security
and beyond.
Scale Enable 10k-‐node clusters.
Streaming Support for 80% of common stream
processing workloads.
22 © Cloudera, Inc. All rights reserved.
Spark Resources • Learn Spark • O’Reilly Advanced Analy8cs with Spark eBook (wri{en by Clouderans) • Cloudera Developer Blog • cloudera.com/spark
• Get Trained • Cloudera Spark Training
• Try it Out • Cloudera Live Spark Tutorial
23 © Cloudera, Inc. All rights reserved.
Thank You [email protected] linkedin.com/in/jordan.volz