Structured Streaming Using Spark 2.1

Structured Streaming Using Spark 2.1BY YASWANTH YADLAPALLI

What is Structured Streaming?

Structured Streaming is a scalable and fault-tolerant stream processing engine built on the Spark SQL engine.

Express streaming computation the same way as a batch computation on static data.

The Spark SQL engine takes care of running it incrementally and continuously and updating the final result as streaming data continues to arrive.

Use the Dataset/DataFrame API to express streaming aggregations, event-time windows, stream-to-batch joins, etc.

Structured Streaming is still ALPHA in Spark 2.1 and the APIs are still experimental

Programing Model

Output modes

There are 3 output modes in Spark Structured streaming

Append mode - Only the new rows added to the Result Table since the last trigger will be outputted to the sink. This is supported for only those queries where rows added to the Result Table is never going to change

Complete mode - The whole Result Table will be outputted to the sink after every trigger. This is supported for aggregation queries.

Update mode - Only the rows in the Result Table that were updated since the last trigger will be outputted to the sink.

Windowing

Windowing

import spark.implicits._

val words = ... // streaming DataFrame of schema { timestamp: Timestamp, word: String }

// Group the data by window and word and compute the count of each group

val windowedCounts = words.groupBy( window($"timestamp", "10 minutes", "5 minutes"), $"word" ).count()

Delayed events

Watermarking

Watermarking and output mode

Watermarking and output mode

Technology

Structured Streaming Using Spark 2.1