45
Spark - Lightning-Fast Cluster Computing by Example Ramesh Mudunuri, Vectorum Saturday, December 6, 2014

Dec6 meetup spark presentation

Embed Size (px)

Citation preview

Page 1: Dec6 meetup spark presentation

Spark - Lightning-Fast Cluster Computing by

Example

Ramesh Mudunuri, VectorumSaturday, December 6, 2014

Page 2: Dec6 meetup spark presentation

About me• Big data enthusiast• Vectorum.com , Startup Product development team

member and using spark technology

Page 3: Dec6 meetup spark presentation

What to expect• Introduction to Spark• Spark Eco system• How Spark is different from Hadoop Map Reduce• Where Spark shines well• How easy to install and start learning• Small code demos• Where to find additional information

This is not…• Training class• Work shop• Product demo with commercial interest

Page 4: Dec6 meetup spark presentation

Spark - Lightning-Fast Cluster Computing by Example• Introduction to Spark• Spark Eco system• How is it different from Hadoop Map Reduce• Where it shines well• How easy to install and start learning• Small code demos• Where to find additional information

Page 5: Dec6 meetup spark presentation

What is Spark ?

General purpose large-scale high performance processing engine

http://spark.apache.org/

Page 6: Dec6 meetup spark presentation

What is Spark ?

Like map-Reduce but in-memory processing engine and also runs fast

http://spark.apache.org/

Page 7: Dec6 meetup spark presentation

What is Spark

• Apache Spark™ is a fast and general engine for large-scale data processing.

Page 8: Dec6 meetup spark presentation

Spark History

• Started as research project in 2009 at UC Berkeley amplab and became Apache open source project since 2010

• Matai Zaharia Spark Dev. team member and Databricks co-founder

Page 9: Dec6 meetup spark presentation

Why is Spark so specialSpeed General purpose faster processing In-memory engine

(relatively) Easy to develop and deploy complex analytical applicationsAPIs for : Java, Scala and Python

Well integrated eco system tools

www.databricks.com

Page 10: Dec6 meetup spark presentation

Why is Spark so special…..• In-memory processing makes well suites for Iterative

nature Algorithm computations • Can run in various setups

– Standalone (my favorite way to learn Spark)– Cluster, EC2, – Yarn, Mesos

• Read data from – Local file system– HDFS– Hbase, Cassandra and …

http://www.cloudera.com

Page 11: Dec6 meetup spark presentation

Spark - Lightning-Fast Cluster Computing by Example• Introduction to Spark• Spark Eco system• How is it different from Hadoop Map Reduce• Where it shines well• How easy to install and start learning• Small code demos• Where to find additional information

Page 12: Dec6 meetup spark presentation

Apache Spark Core• Foundation• Scheduling,• Memory Management• Fault Recovery and etc.

Page 13: Dec6 meetup spark presentation

Spark SQL• Execute SPARK with SQL expressions• Compatible with Hive*• JDBC/ODBC connection capabilities

* Hive :Distributed Data storage SQL software with custom UDF capabilities

Page 14: Dec6 meetup spark presentation

Spark Streaming• Component to process live streaming of the data.• API to handle streaming data• E.g: Sources : Log files, queued messages, sensor emitted

data

Page 15: Dec6 meetup spark presentation

MLlib- Machine leaningLibraries for Machine learning AlgorithmsEg : Classification, regression, clustering, collaborative filtering, dimensionality reduction

Very active Spark Development community

Page 16: Dec6 meetup spark presentation

GraphX

APIs for Graph computation

• PageRank• Connected components• Label propagation• SVD++• Strongly connected

components• Triangle count

Alpha level

Page 17: Dec6 meetup spark presentation

Spark Engine Terminology• Spark Context

– An Object Spark uses to access cluster• Driver & Executor

– Driver runs main program and execute parallel operations– Executor runs inside worker and execute the tasks– Resilient Distributes Dataset (RDD)

• Immutable fault tolerant collection object– RDD functions (similar to Hadoop map-Reduce functions)

1. Transformation2. Action

Page 18: Dec6 meetup spark presentation

Spark shell and Spark context

Page 19: Dec6 meetup spark presentation

Driver & ExecutorDriver runs main program and execute parallel operationsExecutor runs inside worker and execute the tasks

Page 20: Dec6 meetup spark presentation

RDD-Resilient Distributed Dataset

• Resilient Distributed Data‐ sets(RDD)is Spark’s fundamental abstraction for representing a collection of objects that can be distributed across multiple machines in a cluster.

• Simple Definition: Immutable and fault tolerant collection object

• There are two ways to create an RDD in Spark: – 1. Create an RDD from an external data source – 2. Performing a transformation on one or more existing RDDs

– val lines = sc.textFile("/filepath/README.md")

– val errors = lines.filter(_.startsWith("ERROR"))

Page 21: Dec6 meetup spark presentation

RDD

• There are two ways to create an RDD in Spark:

1. Create an RDD from an external data source val lines = sc.textFile("/filepath/REDME.md")

2. Performing a transformation on one or more existing RDDs val errors = lines.filter(_.startsWith("ERROR"))

Page 22: Dec6 meetup spark presentation

Transformation - Action

• Transformations operations are lazy (will not be executed immediately)

• Transformations can create new RDDs from existing RDDe.g filter, map,

• Action operations return final values to driver program or write data into file system

e.g: Collect, SaveAsTextFile

http://www.mapr.com

Page 23: Dec6 meetup spark presentation

Spark - Lightning-Fast Cluster Computing by Example• Introduction to Spark• Spark Eco system• How is Spark different from Hadoop

Map Reduce• Where it shines well• How easy to install and start learning• Small code demos• Where to find additional information

Page 24: Dec6 meetup spark presentation

How is Spark different from Hadoop Map-Reduce

SPARK Hadoop

1 Speed : • 100X in-memory and • 10X on Disk

2 Easy of Use : • Easily write application using Java, Scala, Python• Interactive Shell available with Scala and Python• High level simple map-reduce Operations

• Java• No shell• complex map-reduce operations

3 Tools : • Well integrated tools (Spark SQL, Streaming,

Mllib and etc.) to develop complex analytical application

• Loosely coupled large set of tools, but very matured

4 Deployment: • Hadoop : V1/V2(YARN) • And also Meson, Amazon-EC2

--

5 Data Source: • HDFS(Hadoop), HBase, Cassandra, Amazon-S3

--

Page 25: Dec6 meetup spark presentation

How is spark different from Hadoop Map-Reduce

SPARK Hadoop

6 Applications: • Spark ‘Application’ is higher level of Unit, runs

multiple jobs in sequence or parallel• Application process are called executors, runs on

clusters(workers)

• Hadoop ‘job’ is higher level unit process data with map reduce and writes data to storage

7 Executors: • Executors can run multiple tasks in a single

processor

• Each mapReduce runs in its own processor

8 Shared Variable: • Broadcast variables: Read-only(look-up) variable,

ships only once to worker• Accumulators: Workers add values and driver reads

the data, and fault tolerant

• Hadoop counter have additional (system ) metric counters like ‘Map input records’

9 Persisting/Caching RDD: • Cached RDDs can be used & reused in across the

operation, thus increase the processing speed --

10 Lazy Evaluation: Transformation functions execution plan bundled together and executes only with RDD action function

--

Page 26: Dec6 meetup spark presentation

Spark - Lightning-Fast Cluster Computing by Example• Introduction to Spark• Spark Eco system• How is it different from Hadoop Map Reduce• Where Spark shines well• How easy to install and start learning• Small code demos• Where to find additional information

Page 27: Dec6 meetup spark presentation

Where Spark shines well• Well suited for any iterative computations

– Machine Learning Algorithms– Iterative Analytics

• Multi data source Computations – Multi sourced Sensor data

• Aggregated Analytics– Transforming and Summering the data

Page 28: Dec6 meetup spark presentation

Spark - Lightning-Fast Cluster Computing by Example• Introduction to Spark• Spark Eco system• How is it different from Hadoop Map Reduce• Where it shines well• How easy to install and start learning• Small code demos• Where to find additional information

Page 29: Dec6 meetup spark presentation

• Link http://spark.apache.org/downloads.html

• Standalone - Chose a package type : Prebuild for hadoop1.x• Source code is also available Build Toolls : maven or sbtDistro-versions : Hadoop, Cloudera, MapR

Page 30: Dec6 meetup spark presentation

Current Spark version

Release Cycle : Every 3 months

Page 31: Dec6 meetup spark presentation

How easy to install and start learningCan install quickly on our laptop/PC• Parameter check lists

– JAVA 1.7– SCALA 2.10X– SPARK/Conf

Page 32: Dec6 meetup spark presentation

Spark - Lightning-Fast Cluster Computing by Example• Introduction to Spark• Spark Eco system• How is it different from Hadoop Map Reduce• Where it shines well• How easy to install and start learning• Small code demos• Where to find additional information

Page 33: Dec6 meetup spark presentation

Spark Scala REPL

cd $SPARK_HOME ./bin/spark-shell port 4040

Spark Master & Worker in background cd $SPARK_HOME ./sbin/start-all.sh

Starts both Master and worker

Page 34: Dec6 meetup spark presentation

Spark - Lightning-Fast Cluster Computing by Example• Introduction to Spark• Spark Eco system• How is it different from Hadoop Map Reduce• Where it shines well• How easy to install and start learning• Small code demos• Where to find additional information

Page 35: Dec6 meetup spark presentation

Use case with Spark SQL

• Spark Scala REPL• Spark SQL

• Write some interesting code snippets on REPL using scala • 1. Read Meetup participants info and prepare data file • 2. Use Spark SQL to create a aggregated data• 3. Show visualization with Spark output data

Page 36: Dec6 meetup spark presentation

Spark SQL Code : Create table and Run Queries

1. Create SPARK Context // Spark context will be created as sc when we launch the shell.

2. Create SQL Context3. Create Case Class4. Create RDD5. Create Schema6. Register RDD as Table in the Schema7. Run Select statements8. Save SQL output 9. Visualization - D3

Page 37: Dec6 meetup spark presentation

Code 1. import sqlContext.createSchemaRDD

2. Spark context available as sc

3. val sqlContext = new org.apache.spark.sql.SQLContext(sc)

4. case class Attendees(Name: String, Interest: String )

5. val meetup = sc.textFile("/Users/vectorum/Documents/Ramesh/Dec6/meetup.csv").map(_.split(",")).map(a => Attendees( a(0),a(1)))

6. val hyd = sqlContext.createSchemaRDD(meetup)

7. hyd.registerTempTable("iiit")

8. val iiitRoster = sqlContext.sql("SELECT Name, Interest FROM iiit")

9. iiitRoster.count()

10. iiitRoster.map(a => "Name: " + a(0) + "Interest :" + a(1) ).collect().foreach(println)

11. val iiitAChart = sqlContext.sql("SELECT Interest, count( Interest) FROM iiit group by Interest order by Interest”)

12. iiitAChart.map(a => a(0) + "," + a(1) ).collect().foreach(println)

Page 38: Dec6 meetup spark presentation

Our Product

Page 39: Dec6 meetup spark presentation

VisualizationHighCharts,D3

SPARK(SQL, Hive,MLlib)

DataHDFS, MySql, Files

Technology Stack

Page 40: Dec6 meetup spark presentation

Spark Programing Model• 1. Define set of transformations on input datasets

• 2.Invoke actions that output the transformed dataset into persistent state/local memory

• Running local computations that operate on the results computed in a distributed fashion. These can help decide what transformations and actions to undertake next.

Page 41: Dec6 meetup spark presentation

Example RDD Lineage

HDFS/File

Prepare Dataset(RDD-0)

Cached RDDFiltered Data Set

0

Filtered Data Sets.. n

Export Data

Visualization

Machine Learning

Page 42: Dec6 meetup spark presentation

Demo - Visualization

• Bubble Chart : Data Distribution• Heat Chart :Correlation

Page 43: Dec6 meetup spark presentation

Spark - Lightning-Fast Cluster Computing by Example• Introduction to Spark• Spark Eco system• How is it different from Hadoop Map Reduce• Where it shines well• How easy to install and start learning• Small code demos• Where to find additional information

Page 45: Dec6 meetup spark presentation

Final note

Thank you