Upload
knoldus-software-llp
View
1.268
Download
4
Embed Size (px)
Citation preview
Introduction to Apache Spark 2.0
Himanshu GuptaSr. Software ConsultantKnoldus Software LLP
Agenda
Part 1(SparkSession)
Part 2(Structured Streaming)
Agenda
Part 1(SparkSession)
Part 2(Structured Streaming)
What is Apache Spark ?● A fast and general engine for large-scale
data processing.
● Offers a rich set of API(s) and Libraries
– In Scala, Java, Python and R
● Most active Apache Big Data project.
Img Src: https://www.google.com/
Spark Survey 2015● Reflected answers and opinions
– Of over 1417 respondents from 842 organizations
● Indicated rapid growth of Spark community.
● Displayed positive attitude towards:
– Concise and Unified API for Big Data processing.
● https://databricks.com/blog/2015/09/24/spark-survey-2015-results-are-now-available.html
Apache Spark 2.0● Released in July this year
– In fact version 2.1.0 is already under development.
● Provides a Unified API for SQL, Streaming and Graph operations.
Apache Spark 2.0● Released in July this year
– In fact version 2.1.0 is already under development.
● Provides a Unified API for SQL, Streaming and Graph operations.
SparkSession
What is SparkSession ?
Img Src: https://www.google.com/
What is SparkSession ?
SparkContext
For Core API
What is SparkSession ?
SparkContext StreamingContext
For Core API For Streaming API
What is SparkSession ?
SparkContext StreamingContext SQLContext
For Core API For Streaming API For SQL API
What is SparkSession ?
SparkContext StreamingContext SQLContext
For Core API For Streaming API For SQL API
SparkSessionUnified API
Benefits of Spark 2.0● Unified DataFrames and Datasets
– DataFrames = Datasets[Row]
● 10X faster than Spark 1.6
– Due to Whole-Stage Code Generation.
● Smarter than Spark Streaming 1.6:
– As streaming is structured too.
Img Src: https://databricks.com/blog/2016/07/26/introducing-apache-spark-2-0.html
Why Spark 2.0 is Faster ?
Img Src: https://www.google.com/
Why Spark 2.0 is Faster ?
Reason is
“Whole-Stage Code Generation”
Example
Img Src: https://databricks.com/blog/2016/05/23/apache-spark-as-a-compiler-joining-a-billion-rows-per-second-on-a-laptop.html
Example
Img Src: https://databricks.com/blog/2016/05/23/apache-spark-as-a-compiler-joining-a-billion-rows-per-second-on-a-laptop.html
Example
Img Src: https://databricks.com/blog/2016/05/23/apache-spark-as-a-compiler-joining-a-billion-rows-per-second-on-a-laptop.html
Example
Img Src: https://databricks.com/blog/2016/05/23/apache-spark-as-a-compiler-joining-a-billion-rows-per-second-on-a-laptop.html
Volcano Model
Whats wrong here ?
Img Src: https://databricks.com/blog/2016/05/23/apache-spark-as-a-compiler-joining-a-billion-rows-per-second-on-a-laptop.html
Volcano Model
For AnswerLets compare same code with hand-written code
System Generated Hand Written
Img Src: https://databricks.com/blog/2016/05/23/apache-spark-as-a-compiler-joining-a-billion-rows-per-second-on-a-laptop.html
Volcano Model vs Hand-Written Code
Volcano
Hand-Written
Img Src: https://databricks.com/blog/2016/05/23/apache-spark-as-a-compiler-joining-a-billion-rows-per-second-on-a-laptop.html
Solution
Img Src: https://www.google.com/
Solution
Of Course
Whole-Stage Code Generation
Provides the performance of hand-written code with the functionality of ageneral purpose engine.
What is Whole-Stage Code Generation ?
● Same as Volcano Model– As it generates code using the same process.
● The only difference is– Earlier Spark applied code generation only to
expression evaluation (i.e., “1 + a”) but now it generates code for the entire query.
Spark 1.x vs Spark 2.0
Img Src: https://databricks.com/blog/2016/05/23/apache-spark-as-a-compiler-joining-a-billion-rows-per-second-on-a-laptop.html
Demo 1
Agenda
Part 1(SparkSession)
Part 2(Structured Streaming)
Questions ??
Agenda
Part 1(SparkSession)
Part 2(Structured Streaming)
Streaming Applications
Pros - ● Consistent● In-Order Data● No Shuffling
Cons - ● Non-Scalable● No Fault Tolerance
Pros - ● Scalable● Fault Tolerant
Cons - ● Inconsistent● Out-of-Order Data● Too much Shuffling
Continuous Application
Img Src: https://databricks.com/blog/2016/07/28/continuous-applications-evolving-streaming-in-apache-spark-2-0.html
How to Achieve it ?
Img Src: https://www.google.com/
Solution
Structured Streaming
Structured Streaming guarantees that at any time, the output of the application is equivalent to executing a batch job on a prefix of the data.
How ?
Img Src: https://databricks.com/blog/2016/07/28/structured-streaming-in-apache-spark.html
Conceptually, Structured Streaming treats all the data arriving as an infinite input table.
How ?
● Developer defines a query on the input table
– As if it were a static table.
● Results are computed in a Result Table
– Which are further written to an output sink.
● At last developers define triggers
– To control result modification.
Img Src: https://databricks.com/blog/2016/07/28/structured-streaming-in-apache-spark.html
How ?
● Developer defines a query on the input table
– As if it were a static table.
● Results are computed in a Result Table
– Which are further written to an output sink.
● At last developers define triggers
– To control result modification.
Incremental Execution
Img Src: https://databricks.com/blog/2016/07/28/structured-streaming-in-apache-spark.html
Output Modes● Append
– Only the new rows are appended to the result table since the last trigger will be written to the external storage.
● Complete
– The entire updated result table will be written to external storage
● Update
– Only the rows that were updated in the result table since the last trigger will be changed in the external storage.
Other Benefits● Easy to use
– As it is simple Spark’s DataFrame/Dataset API.
Other Benefits● Easy to use
– As it is simple Spark’s DataFrame/Dataset API.
Img Src: https://databricks.com/blog/2016/07/28/structured-streaming-in-apache-spark.html
Other Benefits● Easy to use
– As it is simple Spark’s DataFrame/Dataset API.
Img Src: https://databricks.com/blog/2016/07/28/structured-streaming-in-apache-spark.html
Other Benefits● Easy to use
– As it is simple Spark’s DataFrame/Dataset API.
● Uses Spark’s DataFrame/Datasets existing API
– So we can map, filter and aggregate data as we do in Spark SQL.
Other Benefits● Easy to use
– As it is simple Spark’s DataFrame/Dataset API.
● Uses Spark’s DataFrame/Datasets existing API
– So we can map, filter and aggregate data as we do in Spark SQL.
● Join Streams with Static data
– To join a stream with a static DataFrame.
Other Benefits● Easy to use
– As it is simple Spark’s DataFrame/Dataset API.
● Uses Spark’s DataFrame/Datasets existing API
– So we can map, filter and aggregate data as we do in Spark SQL.
● Join Streams with Static data
– To join a stream with a static DataFrame.
Img Src: https://databricks.com/blog/2016/07/28/structured-streaming-in-apache-spark.html
Other Benefits● Easy to use
– As it is simple Spark’s DataFrame/Dataset API.
● Uses Spark’s DataFrame/Datasets existing API
– So we can map, filter and aggregate data as we do in Spark SQL.
● Join Streams with Static data
– To join a stream with a static DataFrame.
There are many more...
Requirements
● Input Sources must be replayable– So that recent data can be re-read if the job
crashes.
● Output Source must support transactional updates
– So that the system make a set of records appear atomically.
Comparison with Other Engines
Img Src: https://databricks.com/blog/2016/07/28/structured-streaming-in-apache-spark.html
Demo 2
Agenda
Part 1(SparkSession)
Part 2(Structured Streaming)
Questions ??
References
● https://databricks.com/blog/2016/07/26/introducing-apache-spark-2-0.html
● https://databricks.com/blog/2016/07/28/structured-streaming-in-apache-spark.html
● https://www.youtube.com/watch?v=ZFBgY0PwUeY
● http://spark.apache.org/docs/latest/
Thank You !!!