Upload
krishna-sangeeth-k-s
View
1.011
Download
0
Embed Size (px)
Citation preview
Up and Running with
Achilles Heel for Hadoop Hadoop is not fast enough *apparently* for things like ML .
Need to Read again from disk after each MR job.
{ MR1 => HDFS =>MR2 => HDFS =>MR3 }
MR , Let’s admit is a bit too complicated.
The problem with giant codebase.
{Hadoop : 1.7 Million LOC} {Spark : .35 Million LOC}
Why Spark?
https://databricks.com/blog/2014/10/10/spark-petabyte-sort.html
A brief History of Spark Timeline UC Berkeley : The home of innovation.
2009 : Started as a simple class project.
The UCB folks wanted to create a Cluster Management system Mesos
They needed something to test on top of Mesos . Voila Spark
2010 : Open sourced under BSD licence
Feb 2014 : Became Apache Top Level project
Nov 2014 : New world record in Large scale sorting
https://soundcloud.com/oreilly-radar/apache-sparks-journey-from-academia-to-industry
Spark made wise choices
The Spark Stack
Spark Concepts In memory Processing {Processors : 64 bit ~~ Up to 1 TB RAM} {Fact: RAM will always be faster than disk} {Idea : Compress data, do processing } {Remember : Data is distributed across various machines too} Resilient Distributed Datasets
http://www.gridgain.com/in-memory-computing-in-plain-english/
Resilient This is Sparta and we don’t give up on data without a fight.
Distributed A part of data is everywhere.
Dataset Meh!
A bit more on RDD
Basic unit of data in Spark
RDDs are immutable // int a=0; // final int b =0;
There are two main categories of operations on RDD
a) Transformation => Lazy evaluation. => Creates a new RDD from the existing RDD.
b) Actions => Return values => Write to disk
Eg : My mom asks me to buy Grocery items
Setting UpDownload “Prebuilt for Hadoop 2.4 and later”
Build from source with Maven or sbt.
./bin/pyspark
http://spark.apache.org/downloads.html
Talk is Cheap! Show me the codePyspark shell is REPL.
Creating an RDD a) From data in memory. b) From File. c) From another RDD rdd = sc.parallelize(“ChennaiPy”) // from string nums = [1,2,3] rdd_nums = sc.parallelize(nums) // from list rdd_shakespeare= sc.textFile(“shakespeare.txt”) // from file
Transformations Less Dramatic than this . But beautiful nevertheless.
Classic Example 1 : Map
a) Beauty in this case comes from Lambda Expressions . nums = [1,2,3,4,5,6] rdd_nums = sc.parallelize(nums) // Creating our RDD new_rdd = rdd_nums.map(lambda x : x**2) // You’ve got squares print new_rdd.colect() // Finally some action
Did you say 80 Operations?
http://nbviewer.ipython.org/github/jkthompson/pyspark-pictures/blob/master/pyspark-pictures.ipynb
Use case : Log Analysis
Demo Time
Thank You