51
Introduction to Apache Spark 2.0 Himanshu Gupta Sr. Software Consultant Knoldus Software LLP

Introduction to Apache Spark 2.0

Embed Size (px)

Citation preview

Page 1: Introduction to Apache Spark 2.0

Introduction to Apache Spark 2.0

Himanshu GuptaSr. Software ConsultantKnoldus Software LLP

Page 2: Introduction to Apache Spark 2.0

Agenda

Part 1(SparkSession)

Part 2(Structured Streaming)

Page 3: Introduction to Apache Spark 2.0

Agenda

Part 1(SparkSession)

Part 2(Structured Streaming)

Page 4: Introduction to Apache Spark 2.0

What is Apache Spark ?● A fast and general engine for large-scale

data processing.

● Offers a rich set of API(s) and Libraries

– In Scala, Java, Python and R

● Most active Apache Big Data project.

Img Src: https://www.google.com/

Page 5: Introduction to Apache Spark 2.0

Spark Survey 2015● Reflected answers and opinions

– Of over 1417 respondents from 842 organizations

● Indicated rapid growth of Spark community.

● Displayed positive attitude towards:

– Concise and Unified API for Big Data processing.

● https://databricks.com/blog/2015/09/24/spark-survey-2015-results-are-now-available.html

Page 6: Introduction to Apache Spark 2.0

Apache Spark 2.0● Released in July this year

– In fact version 2.1.0 is already under development.

● Provides a Unified API for SQL, Streaming and Graph operations.

Page 7: Introduction to Apache Spark 2.0

Apache Spark 2.0● Released in July this year

– In fact version 2.1.0 is already under development.

● Provides a Unified API for SQL, Streaming and Graph operations.

SparkSession

Page 8: Introduction to Apache Spark 2.0

What is SparkSession ?

Img Src: https://www.google.com/

Page 9: Introduction to Apache Spark 2.0

What is SparkSession ?

SparkContext

For Core API

Page 10: Introduction to Apache Spark 2.0

What is SparkSession ?

SparkContext StreamingContext

For Core API For Streaming API

Page 11: Introduction to Apache Spark 2.0

What is SparkSession ?

SparkContext StreamingContext SQLContext

For Core API For Streaming API For SQL API

Page 12: Introduction to Apache Spark 2.0

What is SparkSession ?

SparkContext StreamingContext SQLContext

For Core API For Streaming API For SQL API

SparkSessionUnified API

Page 13: Introduction to Apache Spark 2.0

Benefits of Spark 2.0● Unified DataFrames and Datasets

– DataFrames = Datasets[Row]

● 10X faster than Spark 1.6

– Due to Whole-Stage Code Generation.

● Smarter than Spark Streaming 1.6:

– As streaming is structured too.

Img Src: https://databricks.com/blog/2016/07/26/introducing-apache-spark-2-0.html

Page 14: Introduction to Apache Spark 2.0

Why Spark 2.0 is Faster ?

Img Src: https://www.google.com/

Page 15: Introduction to Apache Spark 2.0

Why Spark 2.0 is Faster ?

Reason is

“Whole-Stage Code Generation”

Page 16: Introduction to Apache Spark 2.0

Example

Img Src: https://databricks.com/blog/2016/05/23/apache-spark-as-a-compiler-joining-a-billion-rows-per-second-on-a-laptop.html

Page 17: Introduction to Apache Spark 2.0

Example

Img Src: https://databricks.com/blog/2016/05/23/apache-spark-as-a-compiler-joining-a-billion-rows-per-second-on-a-laptop.html

Page 18: Introduction to Apache Spark 2.0

Example

Img Src: https://databricks.com/blog/2016/05/23/apache-spark-as-a-compiler-joining-a-billion-rows-per-second-on-a-laptop.html

Page 19: Introduction to Apache Spark 2.0

Example

Img Src: https://databricks.com/blog/2016/05/23/apache-spark-as-a-compiler-joining-a-billion-rows-per-second-on-a-laptop.html

Volcano Model

Page 20: Introduction to Apache Spark 2.0

Whats wrong here ?

Img Src: https://databricks.com/blog/2016/05/23/apache-spark-as-a-compiler-joining-a-billion-rows-per-second-on-a-laptop.html

Volcano Model

Page 21: Introduction to Apache Spark 2.0

For AnswerLets compare same code with hand-written code

System Generated Hand Written

Img Src: https://databricks.com/blog/2016/05/23/apache-spark-as-a-compiler-joining-a-billion-rows-per-second-on-a-laptop.html

Page 22: Introduction to Apache Spark 2.0

Volcano Model vs Hand-Written Code

Volcano

Hand-Written

Img Src: https://databricks.com/blog/2016/05/23/apache-spark-as-a-compiler-joining-a-billion-rows-per-second-on-a-laptop.html

Page 23: Introduction to Apache Spark 2.0

Solution

Img Src: https://www.google.com/

Page 24: Introduction to Apache Spark 2.0

Solution

Of Course

Whole-Stage Code Generation

Provides the performance of hand-written code with the functionality of ageneral purpose engine.

Page 25: Introduction to Apache Spark 2.0

What is Whole-Stage Code Generation ?

● Same as Volcano Model– As it generates code using the same process.

● The only difference is– Earlier Spark applied code generation only to

expression evaluation (i.e., “1 + a”) but now it generates code for the entire query.

Page 26: Introduction to Apache Spark 2.0

Spark 1.x vs Spark 2.0

Img Src: https://databricks.com/blog/2016/05/23/apache-spark-as-a-compiler-joining-a-billion-rows-per-second-on-a-laptop.html

Page 27: Introduction to Apache Spark 2.0

Demo 1

Page 28: Introduction to Apache Spark 2.0

Agenda

Part 1(SparkSession)

Part 2(Structured Streaming)

Questions ??

Page 29: Introduction to Apache Spark 2.0

Agenda

Part 1(SparkSession)

Part 2(Structured Streaming)

Page 30: Introduction to Apache Spark 2.0

Streaming Applications

Pros - ● Consistent● In-Order Data● No Shuffling

Cons - ● Non-Scalable● No Fault Tolerance

Pros - ● Scalable● Fault Tolerant

Cons - ● Inconsistent● Out-of-Order Data● Too much Shuffling

Page 31: Introduction to Apache Spark 2.0

Continuous Application

Img Src: https://databricks.com/blog/2016/07/28/continuous-applications-evolving-streaming-in-apache-spark-2-0.html

Page 32: Introduction to Apache Spark 2.0

How to Achieve it ?

Img Src: https://www.google.com/

Page 33: Introduction to Apache Spark 2.0

Solution

Structured Streaming

Structured Streaming guarantees that at any time, the output of the application is equivalent to executing a batch job on a prefix of the data.

Page 34: Introduction to Apache Spark 2.0

How ?

Img Src: https://databricks.com/blog/2016/07/28/structured-streaming-in-apache-spark.html

Conceptually, Structured Streaming treats all the data arriving as an infinite input table.

Page 35: Introduction to Apache Spark 2.0

How ?

● Developer defines a query on the input table

– As if it were a static table.

● Results are computed in a Result Table

– Which are further written to an output sink.

● At last developers define triggers

– To control result modification.

Img Src: https://databricks.com/blog/2016/07/28/structured-streaming-in-apache-spark.html

Page 36: Introduction to Apache Spark 2.0

How ?

● Developer defines a query on the input table

– As if it were a static table.

● Results are computed in a Result Table

– Which are further written to an output sink.

● At last developers define triggers

– To control result modification.

Incremental Execution

Img Src: https://databricks.com/blog/2016/07/28/structured-streaming-in-apache-spark.html

Page 37: Introduction to Apache Spark 2.0

Output Modes● Append

– Only the new rows are appended to the result table since the last trigger will be written to the external storage.

● Complete

– The entire updated result table will be written to external storage

● Update

– Only the rows that were updated in the result table since the last trigger will be changed in the external storage.

Page 38: Introduction to Apache Spark 2.0

Other Benefits● Easy to use

– As it is simple Spark’s DataFrame/Dataset API.

Page 39: Introduction to Apache Spark 2.0

Other Benefits● Easy to use

– As it is simple Spark’s DataFrame/Dataset API.

Img Src: https://databricks.com/blog/2016/07/28/structured-streaming-in-apache-spark.html

Page 40: Introduction to Apache Spark 2.0

Other Benefits● Easy to use

– As it is simple Spark’s DataFrame/Dataset API.

Img Src: https://databricks.com/blog/2016/07/28/structured-streaming-in-apache-spark.html

Page 41: Introduction to Apache Spark 2.0

Other Benefits● Easy to use

– As it is simple Spark’s DataFrame/Dataset API.

● Uses Spark’s DataFrame/Datasets existing API

– So we can map, filter and aggregate data as we do in Spark SQL.

Page 42: Introduction to Apache Spark 2.0

Other Benefits● Easy to use

– As it is simple Spark’s DataFrame/Dataset API.

● Uses Spark’s DataFrame/Datasets existing API

– So we can map, filter and aggregate data as we do in Spark SQL.

● Join Streams with Static data

– To join a stream with a static DataFrame.

Page 43: Introduction to Apache Spark 2.0

Other Benefits● Easy to use

– As it is simple Spark’s DataFrame/Dataset API.

● Uses Spark’s DataFrame/Datasets existing API

– So we can map, filter and aggregate data as we do in Spark SQL.

● Join Streams with Static data

– To join a stream with a static DataFrame.

Img Src: https://databricks.com/blog/2016/07/28/structured-streaming-in-apache-spark.html

Page 44: Introduction to Apache Spark 2.0

Other Benefits● Easy to use

– As it is simple Spark’s DataFrame/Dataset API.

● Uses Spark’s DataFrame/Datasets existing API

– So we can map, filter and aggregate data as we do in Spark SQL.

● Join Streams with Static data

– To join a stream with a static DataFrame.

There are many more...

Page 45: Introduction to Apache Spark 2.0

Requirements

● Input Sources must be replayable– So that recent data can be re-read if the job

crashes.

● Output Source must support transactional updates

– So that the system make a set of records appear atomically.

Page 46: Introduction to Apache Spark 2.0

Comparison with Other Engines

Img Src: https://databricks.com/blog/2016/07/28/structured-streaming-in-apache-spark.html

Page 47: Introduction to Apache Spark 2.0

Demo 2

Page 48: Introduction to Apache Spark 2.0

Agenda

Part 1(SparkSession)

Part 2(Structured Streaming)

Questions ??

Page 49: Introduction to Apache Spark 2.0

Code

https://github.com/knoldus/Sparkathon

Page 50: Introduction to Apache Spark 2.0

References

● https://databricks.com/blog/2016/07/26/introducing-apache-spark-2-0.html

● https://databricks.com/blog/2016/07/28/structured-streaming-in-apache-spark.html

● https://www.youtube.com/watch?v=ZFBgY0PwUeY

● http://spark.apache.org/docs/latest/

Page 51: Introduction to Apache Spark 2.0

Thank You !!!