Dec6 meetup spark presentation

Spark - Lightning-Fast Cluster Computing by

Example

Ramesh Mudunuri, VectorumSaturday, December 6, 2014

About me• Big data enthusiast• Vectorum.com , Startup Product development team

member and using spark technology

http://www.vectorum.com/

What to expect• Introduction to Spark• Spark Eco system• How Spark is different from Hadoop Map Reduce• Where Spark shines well• How easy to install and start learning• Small code demos• Where to find additional information

This is not…• Training class• Work shop• Product demo with commercial interest

Spark - Lightning-Fast Cluster Computing by Example• Introduction to Spark• Spark Eco system• How is it different from Hadoop Map Reduce• Where it shines well• How easy to install and start learning• Small code demos• Where to find additional information

What is Spark ?

General purpose large-scale high performance processing engine

http://spark.apache.org/


What is Spark ?

Like map-Reduce but in-memory processing engine and also runs fast


What is Spark

• Apache Spark™ is a fast and general engine for large-scale data processing.

Spark History

• Started as research project in 2009 at UC Berkeley amplab and became Apache open source project since 2010

• Matai Zaharia Spark Dev. team member and Databricks co-founder

https://amplab.cs.berkeley.edu/

https://databricks.com/

Why is Spark so specialSpeed General purpose faster processing In-memory engine

(relatively) Easy to develop and deploy complex analytical applicationsAPIs for : Java, Scala and Python

Well integrated eco system tools

www.databricks.com

http://www.databricks.com/

Why is Spark so special…..• In-memory processing makes well suites for Iterative

nature Algorithm computations • Can run in various setups

– Standalone (my favorite way to learn Spark)– Cluster, EC2, – Yarn, Mesos

• Read data from – Local file system– HDFS– Hbase, Cassandra and …

http://www.cloudera.com

http://www.cloudera.com/


Apache Spark Core• Foundation• Scheduling,• Memory Management• Fault Recovery and etc.

Spark SQL• Execute SPARK with SQL expressions• Compatible with Hive*• JDBC/ODBC connection capabilities

* Hive :Distributed Data storage SQL software with custom UDF capabilities

Spark Streaming• Component to process live streaming of the data.• API to handle streaming data• E.g: Sources : Log files, queued messages, sensor emitted

data

MLlib- Machine leaningLibraries for Machine learning AlgorithmsEg : Classification, regression, clustering, collaborative filtering, dimensionality reduction

Very active Spark Development community

GraphX

APIs for Graph computation

• PageRank• Connected components• Label propagation• SVD++• Strongly connected

components• Triangle count

Alpha level

Spark Engine Terminology• Spark Context

– An Object Spark uses to access cluster• Driver & Executor

– Driver runs main program and execute parallel operations– Executor runs inside worker and execute the tasks– Resilient Distributes Dataset (RDD)

• Immutable fault tolerant collection object– RDD functions (similar to Hadoop map-Reduce functions)

1. Transformation2. Action

Spark shell and Spark context

Driver & ExecutorDriver runs main program and execute parallel operationsExecutor runs inside worker and execute the tasks

RDD-Resilient Distributed Dataset

• Resilient Distributed Data‐ sets(RDD)is Spark’s fundamental abstraction for representing a collection of objects that can be distributed across multiple machines in a cluster.

• Simple Definition: Immutable and fault tolerant collection object

• There are two ways to create an RDD in Spark: – 1. Create an RDD from an external data source – 2. Performing a transformation on one or more existing RDDs

– val lines = sc.textFile("/filepath/README.md")

– val errors = lines.filter(_.startsWith("ERROR"))

RDD

• There are two ways to create an RDD in Spark:

1. Create an RDD from an external data source val lines = sc.textFile("/filepath/REDME.md")

2. Performing a transformation on one or more existing RDDs val errors = lines.filter(_.startsWith("ERROR"))

Transformation - Action

• Transformations operations are lazy (will not be executed immediately)

• Transformations can create new RDDs from existing RDDe.g filter, map,

• Action operations return final values to driver program or write data into file system

e.g: Collect, SaveAsTextFile

http://www.mapr.com

http://www.mapr.com/

Spark - Lightning-Fast Cluster Computing by Example• Introduction to Spark• Spark Eco system• How is Spark different from Hadoop

Map Reduce• Where it shines well• How easy to install and start learning• Small code demos• Where to find additional information

How is Spark different from Hadoop Map-Reduce

SPARK Hadoop

1 Speed : • 100X in-memory and • 10X on Disk

2 Easy of Use : • Easily write application using Java, Scala, Python• Interactive Shell available with Scala and Python• High level simple map-reduce Operations

• Java• No shell• complex map-reduce operations

3 Tools : • Well integrated tools (Spark SQL, Streaming,

Mllib and etc.) to develop complex analytical application

• Loosely coupled large set of tools, but very matured

4 Deployment: • Hadoop : V1/V2(YARN) • And also Meson, Amazon-EC2

--

5 Data Source: • HDFS(Hadoop), HBase, Cassandra, Amazon-S3

--

How is spark different from Hadoop Map-Reduce

SPARK Hadoop

6 Applications: • Spark ‘Application’ is higher level of Unit, runs

multiple jobs in sequence or parallel• Application process are called executors, runs on

clusters(workers)

• Hadoop ‘job’ is higher level unit process data with map reduce and writes data to storage

7 Executors: • Executors can run multiple tasks in a single

processor

• Each mapReduce runs in its own processor

8 Shared Variable: • Broadcast variables: Read-only(look-up) variable,

ships only once to worker• Accumulators: Workers add values and driver reads

the data, and fault tolerant

• Hadoop counter have additional (system ) metric counters like ‘Map input records’

9 Persisting/Caching RDD: • Cached RDDs can be used & reused in across the

operation, thus increase the processing speed --

10 Lazy Evaluation: Transformation functions execution plan bundled together and executes only with RDD action function

--

Spark - Lightning-Fast Cluster Computing by Example• Introduction to Spark• Spark Eco system• How is it different from Hadoop Map Reduce• Where Spark shines well• How easy to install and start learning• Small code demos• Where to find additional information

Where Spark shines well• Well suited for any iterative computations

– Machine Learning Algorithms– Iterative Analytics

• Multi data source Computations – Multi sourced Sensor data

• Aggregated Analytics– Transforming and Summering the data


• Link http://spark.apache.org/downloads.html

• Standalone - Chose a package type : Prebuild for hadoop1.x• Source code is also available Build Toolls : maven or sbtDistro-versions : Hadoop, Cloudera, MapR

http://spark.apache.org/downloads.html

Current Spark version

Release Cycle : Every 3 months

How easy to install and start learningCan install quickly on our laptop/PC• Parameter check lists

– JAVA 1.7– SCALA 2.10X– SPARK/Conf


Spark Scala REPL

cd $SPARK_HOME ./bin/spark-shell port 4040

Spark Master & Worker in background cd $SPARK_HOME ./sbin/start-all.sh

Starts both Master and worker


Use case with Spark SQL

• Spark Scala REPL• Spark SQL

• Write some interesting code snippets on REPL using scala • 1. Read Meetup participants info and prepare data file • 2. Use Spark SQL to create a aggregated data• 3. Show visualization with Spark output data

Spark SQL Code : Create table and Run Queries

1. Create SPARK Context // Spark context will be created as sc when we launch the shell.

2. Create SQL Context3. Create Case Class4. Create RDD5. Create Schema6. Register RDD as Table in the Schema7. Run Select statements8. Save SQL output 9. Visualization - D3

Code 1. import sqlContext.createSchemaRDD

2. Spark context available as sc

3. val sqlContext = new org.apache.spark.sql.SQLContext(sc)

4. case class Attendees(Name: String, Interest: String )

5. val meetup = sc.textFile("/Users/vectorum/Documents/Ramesh/Dec6/meetup.csv").map(_.split(",")).map(a => Attendees( a(0),a(1)))

6. val hyd = sqlContext.createSchemaRDD(meetup)

7. hyd.registerTempTable("iiit")

8. val iiitRoster = sqlContext.sql("SELECT Name, Interest FROM iiit")

9. iiitRoster.count()

10. iiitRoster.map(a => "Name: " + a(0) + "Interest :" + a(1) ).collect().foreach(println)

11. val iiitAChart = sqlContext.sql("SELECT Interest, count( Interest) FROM iiit group by Interest order by Interest”)

12. iiitAChart.map(a => a(0) + "," + a(1) ).collect().foreach(println)

Our Product

VisualizationHighCharts,D3

SPARK(SQL, Hive,MLlib)

DataHDFS, MySql, Files

Technology Stack

Spark Programing Model• 1. Define set of transformations on input datasets

• 2.Invoke actions that output the transformed dataset into persistent state/local memory

• Running local computations that operate on the results computed in a distributed fashion. These can help decide what transformations and actions to undertake next.

Example RDD Lineage

HDFS/File

Prepare Dataset(RDD-0)

Cached RDDFiltered Data Set

0

Filtered Data Sets.. n

Export Data

Visualization

Machine Learning

Demo - Visualization

• Bubble Chart : Data Distribution• Heat Chart :Correlation


Where to find additional information

• http://spark.apache.org/• http://spark-summit.org/2014#videos• http://databricks.com/spark-training-resources

• Users mailing list [email protected]• Developers mailing list [email protected]

• My twitter handle https://twitter.com/rameshmudunuri



http://spark-summit.org/2014%23videos

http://spark-summit.org/2014%23videos

http://databricks.com/spark-training-resources

http://databricks.com/spark-training-resources

mailto:[email protected]




https://twitter.com/rameshmudunuri



Final note

Thank you

Software

Dec6 meetup spark presentation