Upload
sigmoid
View
283
Download
0
Embed Size (px)
Citation preview
Structured Streaming Using Spark 2.1BY YASWANTH YADLAPALLI
What is Structured Streaming?
Structured Streaming is a scalable and fault-tolerant stream processing engine built on the Spark SQL engine.
Express streaming computation the same way as a batch computation on static data.
The Spark SQL engine takes care of running it incrementally and continuously and updating the final result as streaming data continues to arrive.
Use the Dataset/DataFrame API to express streaming aggregations, event-time windows, stream-to-batch joins, etc.
Structured Streaming is still ALPHA in Spark 2.1 and the APIs are still experimental
Programing Model
Output modes
There are 3 output modes in Spark Structured streaming
Append mode - Only the new rows added to the Result Table since the last trigger will be outputted to the sink. This is supported for only those queries where rows added to the Result Table is never going to change
Complete mode - The whole Result Table will be outputted to the sink after every trigger. This is supported for aggregation queries.
Update mode - Only the rows in the Result Table that were updated since the last trigger will be outputted to the sink.
Windowing
Windowing
import spark.implicits._
val words = ... // streaming DataFrame of schema { timestamp: Timestamp, word: String }
// Group the data by window and word and compute the count of each group
val windowedCounts = words.groupBy( window($"timestamp", "10 minutes", "5 minutes"), $"word" ).count()
Delayed events
Watermarking
Watermarking and output mode
Watermarking and output mode