Upload
bti360
View
129
Download
2
Tags:
Embed Size (px)
Citation preview
Takeaways
To understand: • Why we have big data today • What big data problems Spark solves • How Spark approaches big data differently
But most of all… to feel comfortable trying Spark out!
7.2 B 6.8 B 1.44 B 300 M 236 M 3.5 B / day
All product images owned by respective companies/institutions
Hadoop’s Limitations
Lacks one thing to succeed for: • Iterative Queries • Interactive Queries Fast data sharing
The Solution..
• Resilient Distributed Datasets – A distributed memory abstraction that lets
programmers perform in-memory computations on large clusters in a fault-tolerant manner.
2002 - MapReduce @ Google
2004 – MapReduce Paper
2006 – Hadoop @ Yahoo
2013 – Spark @ Apache
2014 – Spark 1.0 Released
2011 – Hadoop Released
2011 – Hadoop Released 2009 – Spark at UC Berkeley
2002 - MapReduce @ Google
2004 – MapReduce Paper
2006 – Hadoop @ Yahoo
2013 – Spark @ Apache
2014 – Spark 1.0 Released
2009 – Spark at UC Berkeley
2011 – Hadoop Released
Why Spark? Fast
General Purpose
Easy
Image Credit: http://upload.wikimedia.org/wikipedia/commons/9/92/Easy_button.JPG
Why Spark? Fast
General Purpose
Easy
Streaming
Image Credit: http://pixabay.com/en/faucet-water-bad-sanitaryblock-686958/
Why Spark? Fast
General Purpose
Easy
Streaming
Adoption
All product images owned by respective companies/institutions
Spark Use Cases ETL
Machine Learning
Analytics
Table Credit: http://www.wsj.com/articles/SB10001424052970203914304576630742911364206
Spark Use Cases ETL
Machine Learning
Analytics
Modeling
Table Credit: http://www.wsj.com/articles/SB10001424052970203914304576630742911364206
Spark Use Cases ETL
Machine Learning
Analytics
Modeling
Data Mining
Table Credit: http://www.wsj.com/articles/SB10001424052970203914304576630742911364206
Creating RDDs
• From practically any data source – HDFS – Local file system – S3 – NoSQL (Cassandra, Hbase, …) – JDBC
• From any collection • Transform an existing RDD
Text File
File RDD
Word RDD
Word Count RDD
Read File Split Words Count Words
Transformations chain operations together Nothing actually computed yet…
Text File
File RDD
Word RDD
Word Count RDD
All Word
Counts
Read File Split Words Count Words Store Result
Actions compute results. Why is laziness good?
Text File
File RDD
Word RDD
Word Count RDD
All Word
Counts
Read File Split Words Count Words Store Result
Top 10 Words
Only compute what we need Allows you to: - Focus more on algorithm - Worry less about performance
Text File
File RDD
Word RDD
Word Count RDD
All Word
Counts
Read File Split Words Count Words Store Result
Top 10 Words
“A” Word RDD
Words starting with “A”
By default, RDDs recomputed each use
Word RDD
Text File
File RDD
Word Count RDD
All Word
Counts
Read File Split Words Count Words Store Result
Top 10 Words
“A” Word RDD
Words starting with “A”
For better performance… Persist reused RDDs
Word RDD
Word RDD
Text File
File RDD
Word Count RDD
All Word
Counts
Read File Split Words Count Words Store Result
Top 10 Words
“A” Word RDD
Words starting with “A”
RDDs are fault tolerant.
Text File
File RDD
Word RDD
Word Count RDD
All Word
Counts
Read File Split Words Count Words Store Result
Top 10 Words
“A” Word RDD
Words starting with “A”
RDDs are fault tolerant. Lineage allows recreation.
Word Count Example
val input = sc.textFile(”hdfs://...") // HadoopRDD //Transformation val words = input.flatMap(line => line.split(" ")) //FlatMappedRDD //Transformation val result = words.map(word => (word, 1)).reduceByKey((acc, curr) => acc + curr) //Action val collectedResult = result.collect()
Image courtesy of https://spark.apache.org
Creates RDDs Executes code on cluster
Connects our program to Spark “Main”
More Information on Spark
• https://spark.apache.org/docs/latest/index.html • http://www.cs.berkeley.edu/~matei/papers/2013/sosp_spark_streaming.pdf • http://www.cs.berkeley.edu/~matei/papers/2010/hotcloud_spark.pdf • http://www.cs.berkeley.edu/~matei/papers/2012/hotcloud_spark_streaming.pdf • http://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf • https://www.usenix.org/conference/nsdi12/technical-sessions/presentation/zaharia • http://www.meetup.com/Washington-DC-Area-Spark-Interactive/ • https://spark-summit.org/