Up and running with pyspark

Up and Running with

Achilles Heel for Hadoop Hadoop is not fast enough *apparently* for things like ML .

Need to Read again from disk after each MR job.

{ MR1 => HDFS =>MR2 => HDFS =>MR3 }

MR , Let’s admit is a bit too complicated.

The problem with giant codebase.

{Hadoop : 1.7 Million LOC} {Spark : .35 Million LOC}

Why Spark?

https://databricks.com/blog/2014/10/10/spark-petabyte-sort.html

A brief History of Spark Timeline UC Berkeley : The home of innovation.

2009 : Started as a simple class project.

The UCB folks wanted to create a Cluster Management system Mesos

They needed something to test on top of Mesos . Voila Spark

2010 : Open sourced under BSD licence

Feb 2014 : Became Apache Top Level project

Nov 2014 : New world record in Large scale sorting

https://soundcloud.com/oreilly-radar/apache-sparks-journey-from-academia-to-industry

Spark made wise choices

The Spark Stack

Spark Concepts In memory Processing {Processors : 64 bit ~~ Up to 1 TB RAM} {Fact: RAM will always be faster than disk} {Idea : Compress data, do processing } {Remember : Data is distributed across various machines too} Resilient Distributed Datasets

http://www.gridgain.com/in-memory-computing-in-plain-english/

Resilient This is Sparta and we don’t give up on data without a fight.

Distributed A part of data is everywhere.

Dataset Meh!

A bit more on RDD

Basic unit of data in Spark

RDDs are immutable // int a=0; // final int b =0;

There are two main categories of operations on RDD

a) Transformation => Lazy evaluation. => Creates a new RDD from the existing RDD.

b) Actions => Return values => Write to disk

Eg : My mom asks me to buy Grocery items

Setting UpDownload “Prebuilt for Hadoop 2.4 and later”

Build from source with Maven or sbt.

./bin/pyspark

http://spark.apache.org/downloads.html

Talk is Cheap! Show me the codePyspark shell is REPL.

Creating an RDD a) From data in memory. b) From File. c) From another RDD rdd = sc.parallelize(“ChennaiPy”) // from string nums = [1,2,3] rdd_nums = sc.parallelize(nums) // from list rdd_shakespeare= sc.textFile(“shakespeare.txt”) // from file

Transformations Less Dramatic than this . But beautiful nevertheless.

Classic Example 1 : Map

a) Beauty in this case comes from Lambda Expressions . nums = [1,2,3,4,5,6] rdd_nums = sc.parallelize(nums) // Creating our RDD new_rdd = rdd_nums.map(lambda x : x**2) // You’ve got squares print new_rdd.colect() // Finally some action

Did you say 80 Operations?

http://nbviewer.ipython.org/github/jkthompson/pyspark-pictures/blob/master/pyspark-pictures.ipynb

Use case : Log Analysis

Demo Time

Thank You

Technology

Up and running with pyspark