Introduction to Apache Spark 2.0

Introduction to Apache Spark 2.0

Himanshu GuptaSr. Software ConsultantKnoldus Software LLP

Agenda

Part 1(SparkSession)

Part 2(Structured Streaming)

Agenda



What is Apache Spark ?● A fast and general engine for large-scale

data processing.

● Offers a rich set of API(s) and Libraries

– In Scala, Java, Python and R

● Most active Apache Big Data project.

Img Src: https://www.google.com/

Spark Survey 2015● Reflected answers and opinions

– Of over 1417 respondents from 842 organizations

● Indicated rapid growth of Spark community.

● Displayed positive attitude towards:

– Concise and Unified API for Big Data processing.

● https://databricks.com/blog/2015/09/24/spark-survey-2015-results-are-now-available.html

https://databricks.com/blog/2015/09/24/spark-survey-2015-results-are-now-available.html

https://databricks.com/blog/2015/09/24/spark-survey-2015-results-are-now-available.html

Apache Spark 2.0● Released in July this year

– In fact version 2.1.0 is already under development.

● Provides a Unified API for SQL, Streaming and Graph operations.

Apache Spark 2.0● Released in July this year

– In fact version 2.1.0 is already under development.

● Provides a Unified API for SQL, Streaming and Graph operations.

SparkSession

What is SparkSession ?



SparkContext

For Core API


SparkContext StreamingContext

For Core API For Streaming API


SparkContext StreamingContext SQLContext

For Core API For Streaming API For SQL API


SparkContext StreamingContext SQLContext

For Core API For Streaming API For SQL API

SparkSessionUnified API

Benefits of Spark 2.0● Unified DataFrames and Datasets

– DataFrames = Datasets[Row]

● 10X faster than Spark 1.6

– Due to Whole-Stage Code Generation.

● Smarter than Spark Streaming 1.6:

– As streaming is structured too.

Img Src: https://databricks.com/blog/2016/07/26/introducing-apache-spark-2-0.html

Why Spark 2.0 is Faster ?


Why Spark 2.0 is Faster ?

Reason is

“Whole-Stage Code Generation”

Example

Img Src: https://databricks.com/blog/2016/05/23/apache-spark-as-a-compiler-joining-a-billion-rows-per-second-on-a-laptop.html

Example


Example


Example


Volcano Model

Whats wrong here ?


Volcano Model

For AnswerLets compare same code with hand-written code

System Generated Hand Written


Volcano Model vs Hand-Written Code

Volcano

Hand-Written


Solution


Solution

Of Course

Whole-Stage Code Generation

Provides the performance of hand-written code with the functionality of ageneral purpose engine.

What is Whole-Stage Code Generation ?

● Same as Volcano Model– As it generates code using the same process.

● The only difference is– Earlier Spark applied code generation only to

expression evaluation (i.e., “1 + a”) but now it generates code for the entire query.

Spark 1.x vs Spark 2.0


Demo 1

Agenda



Questions ??

Agenda



Streaming Applications

Pros - ● Consistent● In-Order Data● No Shuffling

Cons - ● Non-Scalable● No Fault Tolerance

Pros - ● Scalable● Fault Tolerant

Cons - ● Inconsistent● Out-of-Order Data● Too much Shuffling

Continuous Application

Img Src: https://databricks.com/blog/2016/07/28/continuous-applications-evolving-streaming-in-apache-spark-2-0.html

How to Achieve it ?


Solution

Structured Streaming

Structured Streaming guarantees that at any time, the output of the application is equivalent to executing a batch job on a prefix of the data.

How ?

Img Src: https://databricks.com/blog/2016/07/28/structured-streaming-in-apache-spark.html

Conceptually, Structured Streaming treats all the data arriving as an infinite input table.

How ?

● Developer defines a query on the input table

– As if it were a static table.

● Results are computed in a Result Table

– Which are further written to an output sink.

● At last developers define triggers

– To control result modification.


How ?

● Developer defines a query on the input table

– As if it were a static table.

● Results are computed in a Result Table

– Which are further written to an output sink.

● At last developers define triggers

– To control result modification.

Incremental Execution


Output Modes● Append

– Only the new rows are appended to the result table since the last trigger will be written to the external storage.

● Complete

– The entire updated result table will be written to external storage

● Update

– Only the rows that were updated in the result table since the last trigger will be changed in the external storage.

Other Benefits● Easy to use

– As it is simple Spark’s DataFrame/Dataset API.









● Uses Spark’s DataFrame/Datasets existing API

– So we can map, filter and aggregate data as we do in Spark SQL.





● Join Streams with Static data

– To join a stream with a static DataFrame.














There are many more...

Requirements

● Input Sources must be replayable– So that recent data can be re-read if the job

crashes.

● Output Source must support transactional updates

– So that the system make a set of records appear atomically.

Comparison with Other Engines


Demo 2

Agenda



Questions ??

Code

https://github.com/knoldus/Sparkathon

https://github.com/knoldus/Sparkathon

References

● https://databricks.com/blog/2016/07/26/introducing-apache-spark-2-0.html

● https://databricks.com/blog/2016/07/28/structured-streaming-in-apache-spark.html

● https://www.youtube.com/watch?v=ZFBgY0PwUeY

● http://spark.apache.org/docs/latest/

https://databricks.com/blog/2016/07/26/introducing-apache-spark-2-0.html

https://databricks.com/blog/2016/07/26/introducing-apache-spark-2-0.html

https://databricks.com/blog/2016/07/28/structured-streaming-in-apache-spark.html

https://databricks.com/blog/2016/07/28/structured-streaming-in-apache-spark.html

https://www.youtube.com/watch?v=ZFBgY0PwUeY

http://spark.apache.org/docs/latest/

Thank You !!!