34
CS535 Big Data 2/12/2020 Week 4-B Sangmi Lee Pallickara http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 1 CS535 BIG DATA PART A. BIG DATA TECHNOLOGY 3. DISTRIBUTED COMPUTING MODELS FOR SCALABLE BATCH COMPUTING SECTION 2: IN-MEMORY CLUSTER COMPUTING Sangmi Lee Pallickara Computer Science, Colorado State University http://www.cs.colostate.edu/~cs535 FAQs Submission Deadline for the GEAR Session 1 review Feb 25 Presenters: please upload (canvas) your slides at least 2 hours before the presentation session CS535 Big Data | Computer Science | Colorado State University

CS535 BIG DATA PART A. BIG DATA TECHNOLOGY 3. …cs535/slides/week4-B-2.pdf · valresults =spark.sql("SELECT name FROM people") // The results of SQL queries are DataFramesand support

  • Upload
    others

  • View
    10

  • Download
    0

Embed Size (px)

Citation preview

Page 1: CS535 BIG DATA PART A. BIG DATA TECHNOLOGY 3. …cs535/slides/week4-B-2.pdf · valresults =spark.sql("SELECT name FROM people") // The results of SQL queries are DataFramesand support

CS535 Big Data 2/12/2020 Week 4-B Sangmi Lee Pallickara

http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 1

CS535 BIG DATA

PART A. BIG DATA TECHNOLOGY3. DISTRIBUTED COMPUTING MODELS FOR SCALABLE BATCH COMPUTINGSECTION 2: IN-MEMORY CLUSTER COMPUTING

Sangmi Lee PallickaraComputer Science, Colorado State Universityhttp://www.cs.colostate.edu/~cs535

FAQs• Submission Deadline for the GEAR Session 1 review

• Feb 25• Presenters: please upload (canvas) your slides at least 2 hours before the presentation session

CS535 Big Data | Computer Science | Colorado State University

Page 2: CS535 BIG DATA PART A. BIG DATA TECHNOLOGY 3. …cs535/slides/week4-B-2.pdf · valresults =spark.sql("SELECT name FROM people") // The results of SQL queries are DataFramesand support

CS535 Big Data 2/12/2020 Week 4-B Sangmi Lee Pallickara

http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 2

Topics of Todays Class• 3. Distributed Computing Models for Scalable Batch Computing

• Data Frame• Spark SQL• Datasets

• 4. Real-time Streaming Computing Models: Apache Storm and Twitter Heron• Apache storm model• Parallelism • Grouping methods

CS535 Big Data | Computer Science | Colorado State University

In-Memory Cluster Computing: Apache SparkSQL, DataFrames and Datasets

CS535 Big Data | Computer Science | Colorado State University

Page 3: CS535 BIG DATA PART A. BIG DATA TECHNOLOGY 3. …cs535/slides/week4-B-2.pdf · valresults =spark.sql("SELECT name FROM people") // The results of SQL queries are DataFramesand support

CS535 Big Data 2/12/2020 Week 4-B Sangmi Lee Pallickara

http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 3

What is the Spark SQL?• Spark module for structured data processing

• Interface is provided by Spark• SQL and the Dataset API

• Spark SQL is to execute SQL queries• Available with the command-line or over JDBC/ODBC

CS535 Big Data | Computer Science | Colorado State University

What is the Datasets?• Dataset is a distributed collection of data• New interface added in Spark (since V1.6) provides

• Benefits of RDDs (Storing typing, ability to use lambda functions)• Benefits of Spark SQL’s optimized execution engine

• Available in Scala and Java• Python does not support Datasets APIs

CS535 Big Data | Computer Science | Colorado State University

Page 4: CS535 BIG DATA PART A. BIG DATA TECHNOLOGY 3. …cs535/slides/week4-B-2.pdf · valresults =spark.sql("SELECT name FROM people") // The results of SQL queries are DataFramesand support

CS535 Big Data 2/12/2020 Week 4-B Sangmi Lee Pallickara

http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 4

What is the DataFrames?• DataFrame is a Dataset organized into named columns

• Like a table in a relational database or a data frame in R/Python• Strengthened optimization scheme

• Available with Scala, Java, Python, and R

CS535 Big Data | Computer Science | Colorado State University

In-Memory Cluster Computing: Apache SparkSQL, DataFrames and Datasets

Getting Started

CS535 Big Data | Computer Science | Colorado State University

Page 5: CS535 BIG DATA PART A. BIG DATA TECHNOLOGY 3. …cs535/slides/week4-B-2.pdf · valresults =spark.sql("SELECT name FROM people") // The results of SQL queries are DataFramesand support

CS535 Big Data 2/12/2020 Week 4-B Sangmi Lee Pallickara

http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 5

Create a SparkSession: Starting Point• SparkSession

• The entry point into all functionality in Spark

import org.apache.spark.sql.SparkSession

val spark = SparkSession.builder().appName("Spark SQL basic example").config("spark.some.config.option", "some-value").getOrCreate()

// For implicit conversions like converting RDDs to DataFramesimport spark.implicits._

Find full example code at the Spark repoexamples/src/main/scala/org/apache/spark/examples/sql/SparkSQLExample.scala

CS535 Big Data | Computer Science | Colorado State University

Creating DataFrames• With a SparkSession, applications can create DataFrames from

• Existing RDD• Hive table• Spark data sources

val df = spark.read.json("examples/src/main/resources/people.json")

// Displays the content of the DataFrame to stdoutdf.show()// +----+-------+// | age| name |// +----+-------+// |null|Michael|// | 30| Andy |// | 19| Justin |// +----+-------+

CS535 Big Data | Computer Science | Colorado State University

Page 6: CS535 BIG DATA PART A. BIG DATA TECHNOLOGY 3. …cs535/slides/week4-B-2.pdf · valresults =spark.sql("SELECT name FROM people") // The results of SQL queries are DataFramesand support

CS535 Big Data 2/12/2020 Week 4-B Sangmi Lee Pallickara

http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 6

Untyped Dataset Operation (A.K.A. DataFrame Operations)• DataFrames are just Dataset of Rows in Scala and Java API

• Untyped transformations• “typed operations”?

• Strongly typed Scala/Java Datasets

// This import is needed to use the $-notationimport spark.implicits._

// Print the schema in a tree formatdf.printSchema()

// root// |-- age: long (nullable = true)// |-- name: string (nullable = true)

CS535 Big Data | Computer Science | Colorado State University

Untyped Dataset Operation (A.K.A. DataFrame Operations)// Select only the "name" columndf.select("name").show()

// +-------+// | name|// +-------+// |Michael|// | Andy|// | Justin|// +-------+

CS535 Big Data | Computer Science | Colorado State University

Page 7: CS535 BIG DATA PART A. BIG DATA TECHNOLOGY 3. …cs535/slides/week4-B-2.pdf · valresults =spark.sql("SELECT name FROM people") // The results of SQL queries are DataFramesand support

CS535 Big Data 2/12/2020 Week 4-B Sangmi Lee Pallickara

http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 7

Untyped Dataset Operation (A.K.A. DataFrame Operations)// Select everybody, but increment the age by 1df.select($"name", $"age" + 1).show()

// +-------+---------+// | name. |(age + 1)|// +-------+---------+// |Michael| null|// | Andy. | 31|// | Justin| 20|// +-------+---------+

CS535 Big Data | Computer Science | Colorado State University

Untyped Dataset Operation// Select people older than 21df.filter($"age" > 21).show()

// +---+----+// |age|name|// +---+----+// | 30|Andy|// +---+----+

CS535 Big Data | Computer Science | Colorado State University

Page 8: CS535 BIG DATA PART A. BIG DATA TECHNOLOGY 3. …cs535/slides/week4-B-2.pdf · valresults =spark.sql("SELECT name FROM people") // The results of SQL queries are DataFramesand support

CS535 Big Data 2/12/2020 Week 4-B Sangmi Lee Pallickara

http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 8

Running SQL Queries • SELECT * FROM people// Register the DataFrame as a SQL temporary viewdf.createOrReplaceTempView("people")

val sqlDF = spark.sql("SELECT * FROM people")sqlDF.show()

// +----+-------+// | age| name|// +----+-------+// |null|Michael|// | 30| Andy|// | 19| Justin|// +----+-------+

CS535 Big Data | Computer Science | Colorado State University

Global Temporary View• Temporary views in Spark SQL

• Session-scoped • Will disappear if the session that creates it terminates

• Global temporary view• Shared among all sessions and keep alive until the Spark application terminates• A system preserved database

CS535 Big Data | Computer Science | Colorado State University

Page 9: CS535 BIG DATA PART A. BIG DATA TECHNOLOGY 3. …cs535/slides/week4-B-2.pdf · valresults =spark.sql("SELECT name FROM people") // The results of SQL queries are DataFramesand support

CS535 Big Data 2/12/2020 Week 4-B Sangmi Lee Pallickara

http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 9

Global Temporary View• SELECT * FROM people// Register the DataFrame as a global temporary viewdf.createGlobalTempView("people")

// Global temporary view is tied to a system preserved database `global_temp`

spark.sql("SELECT * FROM global_temp.people").show()

// +----+-------+// | age| name |// +----+-------+// |null|Michael|// | 30 | Andy |// | 19 | Justin|// +----+-------+

CS535 Big Data | Computer Science | Colorado State University

Global Temporary View// Global temporary view is cross-session

spark.newSession().sql("SELECT * FROM global_temp.people").show()

// +----+-------+// | age| name. |// +----+-------+// |null|Michael|// | 30 | Andy |// | 19 | Justin|// +----+-------+

CS535 Big Data | Computer Science | Colorado State University

Page 10: CS535 BIG DATA PART A. BIG DATA TECHNOLOGY 3. …cs535/slides/week4-B-2.pdf · valresults =spark.sql("SELECT name FROM people") // The results of SQL queries are DataFramesand support

CS535 Big Data 2/12/2020 Week 4-B Sangmi Lee Pallickara

http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 10

Creating Datasets• Datasets are similar to RDDs

• Serializes object with Encoder (not standard java/Kryo serialization)• Datasets are using non-standard serialization library (Spark’s Encoder)

• Many of Spark Dataset operations can be performed without deserializing object

case class Person(name: String, age: Long)

// Encoders are created for case classesval caseClassDS = Seq(Person("Andy", 32)).toDS()caseClassDS.show()

// +----+---+// |name|age|// +----+---+// |Andy| 32|// +----+---+

CS535 Big Data | Computer Science | Colorado State University

Creating Datasets// Encoders for most common types are automatically provided by importing spark.implicits._

val primitiveDS = Seq(1, 2, 3).toDS()primitiveDS.map(_ + 1).collect() // Returns: Array(2, 3, 4)

// DataFrames can be converted to a Dataset by providing a class. // Mapping will be done by nameval path = "examples/src/main/resources/people.json"val peopleDS = spark.read.json(path).as[Person]peopleDS.show()

// +----+-------+// | age| name|// +----+-------+// |null|Michael|// | 30 | Andy. |// | 19 | Justin|// +----+-------+

CS535 Big Data | Computer Science | Colorado State University

Page 11: CS535 BIG DATA PART A. BIG DATA TECHNOLOGY 3. …cs535/slides/week4-B-2.pdf · valresults =spark.sql("SELECT name FROM people") // The results of SQL queries are DataFramesand support

CS535 Big Data 2/12/2020 Week 4-B Sangmi Lee Pallickara

http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 11

In-Memory Cluster Computing: Apache SparkSQL, DataFrames and Datasets

Interacting with RDDs

CS535 Big Data | Computer Science | Colorado State University

Interoperating with RDDs• Converting RDDs into Datasets

• Case 1: Using reflections to infer the schema of an RDD • Case 2: Using a programmatic interface to construct a schema and then apply it to an existing

RDD

CS535 Big Data | Computer Science | Colorado State University

Page 12: CS535 BIG DATA PART A. BIG DATA TECHNOLOGY 3. …cs535/slides/week4-B-2.pdf · valresults =spark.sql("SELECT name FROM people") // The results of SQL queries are DataFramesand support

CS535 Big Data 2/12/2020 Week 4-B Sangmi Lee Pallickara

http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 12

Interoperating with RDDs: 1. Using Reflection• Automatic converting of an RDD (containing case classes) to a DataFrame

• The case class defines the schema of the table

• E.g. the names of the arguments to the case class are read using reflection • become the names of the columns

• Case classes can also be nested or contain complex types such as Seqs or Arrays

• RDD will be implicitly converted to a DataFrame and then be registered as a table

CS535 Big Data | Computer Science | Colorado State University

Interoperating with RDDs: 1. Using Reflection// For implicit conversions from RDDs to DataFramesimport spark.implicits._

// Create an RDD of Person objects from a text file, convert it to a Dataframeval peopleDF = spark.sparkContext

.textFile("examples/src/main/resources/people.txt")

.map(_.split(","))

.map(attributes => Person(attributes(0), attributes(1).trim.toInt))

.toDF()

// Register the DataFrame as a temporary viewpeopleDF.createOrReplaceTempView("people")

// SQL statements can be run by using the sql methods provided by Sparkval teenagersDF = spark.sql("SELECT name, age FROM people WHERE age BETWEEN 13 AND 19")

CS535 Big Data | Computer Science | Colorado State University

Page 13: CS535 BIG DATA PART A. BIG DATA TECHNOLOGY 3. …cs535/slides/week4-B-2.pdf · valresults =spark.sql("SELECT name FROM people") // The results of SQL queries are DataFramesand support

CS535 Big Data 2/12/2020 Week 4-B Sangmi Lee Pallickara

http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 13

Interoperating with RDDs: 2. Specifying the Schema Programmatically• When case classes cannot be defined ahead of time

• Step 1: Create an RDD of Rows from the original RDD

• Steo 2: Create the schema represented by a StructType matching the structure of Rows in the RDD created in Step 1

• Step 3: Apply the schema to the RDD of Rows via createDataFrame method provided by SparkSession.

CS535 Big Data | Computer Science | Colorado State University

import org.apache.spark.sql.types._

// Create an RDDval peopleRDD =spark.sparkContext.textFile("examples/src/main/resources/people.txt")

// The schema is encoded in a stringval schemaString = "name age"

// Generate the schema based on the string of schemaval fields = schemaString.split(" ")

.map(fieldName => StructField(fieldName, StringType, nullable = true))val schema = StructType(fields)

// Convert records of the RDD (peopleRDD) to Rowsval rowRDD = peopleRDD

.map(_.split(","))

.map(attributes => Row(attributes(0), attributes(1).trim))

Interoperating with RDDs: 2. Specifying the Schema Programmatically

CS535 Big Data | Computer Science | Colorado State University

Page 14: CS535 BIG DATA PART A. BIG DATA TECHNOLOGY 3. …cs535/slides/week4-B-2.pdf · valresults =spark.sql("SELECT name FROM people") // The results of SQL queries are DataFramesand support

CS535 Big Data 2/12/2020 Week 4-B Sangmi Lee Pallickara

http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 14

// Apply the schema to the RDDval peopleDF = spark.createDataFrame(rowRDD, schema)

// Creates a temporary view using the DataFramepeopleDF.createOrReplaceTempView("people")

// SQL can be run over a temporary view created using DataFramesval results = spark.sql("SELECT name FROM people")

// The results of SQL queries are DataFrames and support all the RDD operations// The columns of a row in the result can be accessed by field index or by field results.map(attributes => "Name: " + attributes(0)).show()// +-------------+// | value|// +-------------+// |Name: Michael|// | Name: Andy|// | Name: Justin|// +-------------+

Interoperating with RDDs: 2. Specifying the Schema Programmatically

CS535 Big Data | Computer Science | Colorado State University

Datasets and Type Safety• Datasets are composed of typed objects

• transformation syntax errors (e.g. a typo in the method name) and analysis errors (e.g. an incorrect input variable type) can be caught at compile time

• DataFrames are composed of untyped Row objects • only syntax errors can be caught at compile time

• Spark SQL is composed of a string • syntax errors and analysis errors are only caught

at runtime

CS535 Big Data | Computer Science | Colorado State University

Page 15: CS535 BIG DATA PART A. BIG DATA TECHNOLOGY 3. …cs535/slides/week4-B-2.pdf · valresults =spark.sql("SELECT name FROM people") // The results of SQL queries are DataFramesand support

CS535 Big Data 2/12/2020 Week 4-B Sangmi Lee Pallickara

http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 15

CS535 BIG DATA

PART A. BIG DATA TECHNOLOGY4. REAL-TIME STREAMING COMPUTING MODELS: APACHE STORM AND TWITTER HERON

Sangmi Lee PallickaraComputer Science, Colorado State Universityhttp://www.cs.colostate.edu/~cs535

This material is built based on• Ankit Toshniwal, Siddarth Taneja, Amit Shukla, Karthik Ramasamy, Jignesh M. Patel,

Sanjeev Kulkarni, Jason Jackson, Krishna Gade, Maosong Fu, Jake Donham, Nikunj Bhagat, Sailesh Mittal, and Dmitriy Ryaboy. 2014. Storm@twitter. In Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data (SIGMOD '14). ACM, New York, NY, USA, 147-156. DOI: https://doi.org/10.1145/2588555.2595641

• Sanjeev Kulkarni, Nikunj Bhagat, Maosong Fu, Vikas Kedigehalli, Christopher Kellogg, Sailesh Mittal, Jignesh M. Patel, Karthik Ramasamy, and Siddarth Taneja. 2015. Twitter Heron: Stream Processing at Scale. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data (SIGMOD '15). ACM, New York, NY, USA, 239-250. DOI: https://doi.org/10.1145/2723372.2742788

• Storm programming guide• http://storm.apache.org/releases/2.0.0-SNAPSHOT/index.html

CS535 Big Data | Computer Science | Colorado State University

Page 16: CS535 BIG DATA PART A. BIG DATA TECHNOLOGY 3. …cs535/slides/week4-B-2.pdf · valresults =spark.sql("SELECT name FROM people") // The results of SQL queries are DataFramesand support

CS535 Big Data 2/12/2020 Week 4-B Sangmi Lee Pallickara

http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 16

4. Real-time Streaming Computing Models: Apache Storm and Twitter Heron

Apache Storm Model

CS535 Big Data | Computer Science | Colorado State University

Storm Model• One-at-a-time stream processing• Represents the entire stream processing pipeline as a graph of computation called a

topology

• A single program is deployed across a cluster

• A stream is represented an infinite sequence of tuples• A tuple: a named list of values

Data stream

Tuple Tuple Tuple Tuple Tuple Tuple

CS535 Big Data | Computer Science | Colorado State University

Page 17: CS535 BIG DATA PART A. BIG DATA TECHNOLOGY 3. …cs535/slides/week4-B-2.pdf · valresults =spark.sql("SELECT name FROM people") // The results of SQL queries are DataFramesand support

CS535 Big Data 2/12/2020 Week 4-B Sangmi Lee Pallickara

http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 17

Why use Storm?• Distributed real-time computation system

• Realtime analytics• Online machine learning• Continuous computation• Distributed RPC• ETL

• Use Case• Twitter• Amazon Kinesis and Storm

• KinesisSpoutConfig• KinesisSpout

CS535 Big Data | Computer Science | Colorado State University

Spout in the Storm model• Spout

• A source of streams in a topology• A spout can read from a Kestrel or Kafka queue• Turns the data into a tuple stream

Data stream

Tuple Tuple Tuple Tuple

Data streamTuple Tuple Tuple Tuple

CS535 Big Data | Computer Science | Colorado State University

Page 18: CS535 BIG DATA PART A. BIG DATA TECHNOLOGY 3. …cs535/slides/week4-B-2.pdf · valresults =spark.sql("SELECT name FROM people") // The results of SQL queries are DataFramesand support

CS535 Big Data 2/12/2020 Week 4-B Sangmi Lee Pallickara

http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 18

Bolt in the Storm model• Bolt

• Performs actions on streams• Takes any number of streams as input and produces any number of streams as output• Runs functions, filters data, computes aggregations, does streaming joins, updates database, etc.

Data stream

TupleTuple

Tuple

Data streamTuple Tuple Tuple Data stream

Tuple Tuple

CS535 Big Data | Computer Science | Colorado State University

Topology in the Storm model• Topology

• A network of spouts and bolts with each edge representing a bolt that processes the output stream of another spout or bolt

• Task• Each instance of a spout or bolt

CS535 Big Data | Computer Science | Colorado State University

Page 19: CS535 BIG DATA PART A. BIG DATA TECHNOLOGY 3. …cs535/slides/week4-B-2.pdf · valresults =spark.sql("SELECT name FROM people") // The results of SQL queries are DataFramesand support

CS535 Big Data 2/12/2020 Week 4-B Sangmi Lee Pallickara

http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 19

4. Real-time Streaming Computing Models: Apache Storm and Twitter Heron

Apache StormWord Count Example

CS535 Big Data | Computer Science | Colorado State University

Word count topology: Sentence Spout

• Sentence spout• Emits a stream of single-value tuples continuously with the key name

“sentence” and a string value{“sentence”:”my dog has fleas”}

Sentence Spout Split SentenceBolt

Word Count Bolt Report Bolt

CS535 Big Data | Computer Science | Colorado State University

Page 20: CS535 BIG DATA PART A. BIG DATA TECHNOLOGY 3. …cs535/slides/week4-B-2.pdf · valresults =spark.sql("SELECT name FROM people") // The results of SQL queries are DataFramesand support

CS535 Big Data 2/12/2020 Week 4-B Sangmi Lee Pallickara

http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 20

Word count topology: Split Sentence

• Split Sentence Bolt• Subscribes to the sentence spout’s tuple stream{“word”:”my”}{“word”:”dog”}{“word”:”has”}{“word”:”fleas”}

Sentence Spout Split SentenceBolt

Word Count Bolt Report Bolt

CS535 Big Data | Computer Science | Colorado State University

Word count topology: Word Count

• Word count bolt• Subscribes to the output of the SplitSentenceBolt class• Keeps a count of how many times it has seen a particular word• Whenever it receives a tuple, it will increment the counter and emit• {“word”:”dog”, “count”:5}

Sentence Spout Split SentenceBolt

Word Count Bolt Report Bolt

CS535 Big Data | Computer Science | Colorado State University

Page 21: CS535 BIG DATA PART A. BIG DATA TECHNOLOGY 3. …cs535/slides/week4-B-2.pdf · valresults =spark.sql("SELECT name FROM people") // The results of SQL queries are DataFramesand support

CS535 Big Data 2/12/2020 Week 4-B Sangmi Lee Pallickara

http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 21

Word count topology: Report

• Report bolt• Subscribes to the output of the WordCountBolt class• Keeps a count of how many times it has seen a particular word• Whenever it receives a tuple, it will update the table and print the contents

to the console

Sentence Spout Split SentenceBolt

Word Count Bolt Report Bolt

CS535 Big Data | Computer Science | Colorado State University

SentenceSpout.javapublic class SentenceSpout extends BaseRichSpout {

private SpoutOutputCollector collector; private String[] sentences = {

"my dog has fleas", "i like cold beverages", "the dog ate my homework", "don't have a truck", "i don't think i like fleas"

};

private int index = 0; public void declareOutputFields(OutputFieldsDeclarer declarer) {

declarer.declare(new Fields(“sentence")); }

CS535 Big Data | Computer Science | Colorado State University

Page 22: CS535 BIG DATA PART A. BIG DATA TECHNOLOGY 3. …cs535/slides/week4-B-2.pdf · valresults =spark.sql("SELECT name FROM people") // The results of SQL queries are DataFramesand support

CS535 Big Data 2/12/2020 Week 4-B Sangmi Lee Pallickara

http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 22

SentenceSpout.java: Continuedpublic void open(Map config, TopologyContext context,

SpoutOutputCollector collector) { this.collector = collector;

}

public void nextTuple() { this.collector.emit(new Values(sentences[index]));

index ++; if (index > = sentences.length) {

index = 0; }

Utils.waitForMillis(1);

}}

CS535 Big Data | Computer Science | Colorado State University

SplitSentenceBolt.javapublic class SplitSentenceBolt extends BaseRichBolt{

private OutputCollector collector; public void prepare( Map config, TopologyContext context,

OutputCollector collector) {

this.collector = collector; } public void execute(Tuple tuple) {

String sentence = tuple.getStringByField(“sentence"); String[] words = sentence.split(" "); for(String word : words) {

this.collector.emit(new Values(word)); }

} public void declareOutputFields(OutputFieldsDeclarer declarer) {

declarer.declare(new Fields(“word")); }

}

CS535 Big Data | Computer Science | Colorado State University

Page 23: CS535 BIG DATA PART A. BIG DATA TECHNOLOGY 3. …cs535/slides/week4-B-2.pdf · valresults =spark.sql("SELECT name FROM people") // The results of SQL queries are DataFramesand support

CS535 Big Data 2/12/2020 Week 4-B Sangmi Lee Pallickara

http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 23

WordCountBolt.javapublic class WordCountBolt extends BaseRichBolt{

private OutputCollector collector; private HashMap < String, Long > counts = null; public void prepare(Map config, TopologyContext context, OutputCollector

collector) { this.collector = collector; this.counts = new HashMap < String, Long >();

}public void execute(Tuple tuple) {

String word = tuple.getStringByField(“word"); Long count = this.counts.get(word); if(count == null){ count = 0; } count ++; this.counts.put(word, count); this.collector.emit(new Values(word, count));

} public void declareOutputFields( OutputFieldsDeclarer declarer) {

declarer.declare(new Fields(“word", "count")); }

}

CS535 Big Data | Computer Science | Colorado State University

ReportBolt.javapublic class ReportBolt extends BaseRichBolt {

private HashMap < String, Long > counts = null; public void prepare(Map config, TopologyContext context, OutputCollector collector) {

this.counts = new HashMap < String, Long >(); } public void execute(Tuple tuple) {

String word = tuple.getStringByField(“word"); Long count = tuple.getLongByField(“count"); this.counts.put(word, count);

} public void declareOutputFields(OutputFieldsDeclarer declarer) {//no output} public void cleanup() {

System.out.println("--- FINAL COUNTS ---");List < String > keys = new ArrayList < String >(); keys.addAll( this.counts.keySet()); Collections.sort(keys); for (String key : keys) {

System.out.println(key + " : " + this.counts.get(key)); }

}}

CS535 Big Data | Computer Science | Colorado State University

Page 24: CS535 BIG DATA PART A. BIG DATA TECHNOLOGY 3. …cs535/slides/week4-B-2.pdf · valresults =spark.sql("SELECT name FROM people") // The results of SQL queries are DataFramesand support

CS535 Big Data 2/12/2020 Week 4-B Sangmi Lee Pallickara

http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 24

WordCountTopology.javapublic class WordCountTopology {

private static final String SENTENCE_SPOUT_ID = "sentence-spout"; private static final String SPLIT_BOLT_ID = "split-bolt"; private static final String COUNT_BOLT_ID = "count-bolt"; private static final String REPORT_BOLT_ID = "report-bolt"; private static final String TOPOLOGY_NAME = "word-count-topology";

public static void main( String[] args) throws Exception { SentenceSpout spout = new SentenceSpout(); SplitSentenceBolt splitBolt = new SplitSentenceBolt(); WordCountBolt countBolt = new WordCountBolt(); ReportBolt reportBolt = new ReportBolt(); TopologyBuilder builder = new TopologyBuilder(); builder.setSpout(SENTENCE_SPOUT_ID, spout); // SentenceSpout --> SplitSentenceBoltbuilder.setBolt(SPLIT_BOLT_ID,

splitBolt).shuffleGrouping(SENTENCE_SPOUT_ID);

CS535 Big Data | Computer Science | Colorado State University

WordCountTopology.java Continued

// SplitSentenceBolt--> WordCountBoltbuilder.setBolt(COUNT_BOLT_ID, countBolt)

.fieldsGrouping(SPLIT_BOLT_ID, new Fields(“word")); // WordCountBolt --> ReportBoltbuilder.setBolt(REPORT_BOLT_ID, reportBolt)

.globalGrouping(COUNT_BOLT_ID); Config config = new Config();

LocalCluster cluster = new LocalCluster(); cluster.submitTopology(TOPOLOGY_NAME, config,

builder.createTopology()); waitForSeconds(10); cluster.killTopology(TOPOLOGY_NAME); cluster.shutdown();

} }

CS535 Big Data | Computer Science | Colorado State University

Page 25: CS535 BIG DATA PART A. BIG DATA TECHNOLOGY 3. …cs535/slides/week4-B-2.pdf · valresults =spark.sql("SELECT name FROM people") // The results of SQL queries are DataFramesand support

CS535 Big Data 2/12/2020 Week 4-B Sangmi Lee Pallickara

http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 25

Results--- FINAL COUNTS ---a : 1426 ate : 1426 beverages : 1426 cold : 1426 cow : 1426 dog : 2852 don't : 2851 fleas : 2851 has : 1426 have : 1426 homework : 1426 i : 4276 like : 2851 man : 1426 my : 2852 the : 1426 think : 1425 --------------

CS535 Big Data | Computer Science | Colorado State University

4. Real-time Streaming Computing Models: Apache Storm and Twitter Heron

Apache Storm

Parallelism in Storm

CS535 Big Data | Computer Science | Colorado State University

Page 26: CS535 BIG DATA PART A. BIG DATA TECHNOLOGY 3. …cs535/slides/week4-B-2.pdf · valresults =spark.sql("SELECT name FROM people") // The results of SQL queries are DataFramesand support

CS535 Big Data 2/12/2020 Week 4-B Sangmi Lee Pallickara

http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 26

Components of the Storm cluster• Nodes (machines)

• Executes portions of a topology• Workers (JVMs)

• Independent JVM processes running on a node• Each node is configured to run one or more workers • A topology may request one or more workers to be assigned to it

• Executors (threads)• Java threads running within a worker JVM process• Multiple tasks can be assigned to a single executor

• Unless explicitly overridden, Storm will assign one task to each executor

• Tasks (bolt/ spout instances)• Instances of spouts and bolts whose nextTuple() and execute() methods are called by executor

threads

CS535 Big Data | Computer Science | Colorado State University

Parallelism in the WordCount topology• In our example, we have NOT used any of Storm’s parallelism

• Default setting is a factor of one

• Topology execution flow

Executor(Thread)

Task (Sentence spout)

Executor(Thread)

Task (Report Bolt)

Executor(Thread)

Task (Word Count Bolt)

Executor(Thread)

Task (Split Sentence Bolt)

Worker (JVM)

Node

CS535 Big Data | Computer Science | Colorado State University

Page 27: CS535 BIG DATA PART A. BIG DATA TECHNOLOGY 3. …cs535/slides/week4-B-2.pdf · valresults =spark.sql("SELECT name FROM people") // The results of SQL queries are DataFramesand support

CS535 Big Data 2/12/2020 Week 4-B Sangmi Lee Pallickara

http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 27

Adding workers to a topology• Through configuration

• Through APIs• Passing Config object to the submitTopology() method

• Bolts and spouts do not have to change

Config config = new Config();config.setNumWorkers(2);

CS535 Big Data | Computer Science | Colorado State University

Adding executors and tasks• Specify the number of executors when defining a stream grouping

• builder.setSpout(SENTENCE_SPOUT_ID, spout, 2);• Assigns two tasks and each task is assigned its own executor thread

2 executors to a single worker

CS535 Big Data | Computer Science | Colorado State University

Page 28: CS535 BIG DATA PART A. BIG DATA TECHNOLOGY 3. …cs535/slides/week4-B-2.pdf · valresults =spark.sql("SELECT name FROM people") // The results of SQL queries are DataFramesand support

CS535 Big Data 2/12/2020 Week 4-B Sangmi Lee Pallickara

http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 28

Two spout tasks (if we are using one worker)

Executor(Thread)

Task (Sentence spout)

Executor(Thread)

Task (Report Bolt)

Executor(Thread)

Task (Word Count Bolt)

Executor(Thread)

Task (Split Sentence Bolt)

Worker (JVM)

Node

Executor(Thread)

Task (Sentence spout)

CS535 Big Data | Computer Science | Colorado State University

builder.setBolt(SPLIT_BOLT_ID, splitBolt, 2).setNumTasks(4).shuffleGrouping(SENTENCE_SPOUT_ID);

In SplitSentenceBolt and WordCountBolt,

• Set up the split sentence bolt to execute as 4 tasks and 2 executors• Parallelism hint• Storm will run 2 tasks per executor (thread)• Each executor thread will be assigned two tasks to execute

• How many workers will work for this example?

2 executors to a single worker

Total tasks are 4

CS535 Big Data | Computer Science | Colorado State University

Page 29: CS535 BIG DATA PART A. BIG DATA TECHNOLOGY 3. …cs535/slides/week4-B-2.pdf · valresults =spark.sql("SELECT name FROM people") // The results of SQL queries are DataFramesand support

CS535 Big Data 2/12/2020 Week 4-B Sangmi Lee Pallickara

http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 29

builder.setBolt(SPLIT_BOLT_ID, splitBolt, 2).setNumTasks(4).shuffleGrouping(SENTENCE_SPOUT_ID);

In SplitSentenceBolt and WordCountBolt,

• Set up the split sentence bolt to execute as 4 tasks and 2 executors• Parallelism hint• Storm will run 2 tasks per executor (thread)• Each executor thread will be assigned two tasks to execute

• How many workers will work for this example?• Answer: 2 workers (JVMs)

• 2 executors per worker

2 executors to a single worker

Total tasks are 4

CS535 Big Data | Computer Science | Colorado State University

4. Real-time Streaming Computing Models: Apache Storm and Twitter Heron

Apache Storm

Stream Groupings

CS535 Big Data | Computer Science | Colorado State University

Page 30: CS535 BIG DATA PART A. BIG DATA TECHNOLOGY 3. …cs535/slides/week4-B-2.pdf · valresults =spark.sql("SELECT name FROM people") // The results of SQL queries are DataFramesand support

CS535 Big Data 2/12/2020 Week 4-B Sangmi Lee Pallickara

http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 30

Stream groupings• How a stream’s tuples are distributed among bolt tasks in a topology

• E.g. SplitSentenceBolt class was assigned four tasks in the topology• Which tuples will be processed in which task?

• The stream grouping determines which one of those tasks will receive a given tuple

CS535 Big Data | Computer Science | Colorado State University

Seven built-in stream groupings (1/3)• Shuffle grouping

• Randomly distributes tuples across the target bolt’s tasks

• Fields grouping• Routes tuples to bolt tasks based on the values of the fields specified in the grouping• Grouped on the “word” field

• Tuples with the same value for the “word” field will always be routed to the same bolt task

• All grouping• Replicates the tuple stream across all bolt tasks

CS535 Big Data | Computer Science | Colorado State University

Page 31: CS535 BIG DATA PART A. BIG DATA TECHNOLOGY 3. …cs535/slides/week4-B-2.pdf · valresults =spark.sql("SELECT name FROM people") // The results of SQL queries are DataFramesand support

CS535 Big Data 2/12/2020 Week 4-B Sangmi Lee Pallickara

http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 31

Seven built-in stream groupings (2/3)• Global grouping

• Routes all tuples in a stream to a single task• Chooses the task with the lowest task ID value

• None grouping• Functionally equivalent to the shuffle grouping• Reserved for future use

• Direct grouping• The source stream decides which component will receive a given tuple

• By calling the emitDirect() method• Only for streams that have been declared as direct streams

CS535 Big Data | Computer Science | Colorado State University

Seven built-in stream groupings (3/3)• Local or shuffle grouping

• Shuffles tuples among bolt tasks running in the same worker process, if any• Otherwise, performs shuffle grouping• Depending on the parallelism of a topology, the local or shuffle grouping can increase topology

performance by limiting network transfer

CS535 Big Data | Computer Science | Colorado State University

Page 32: CS535 BIG DATA PART A. BIG DATA TECHNOLOGY 3. …cs535/slides/week4-B-2.pdf · valresults =spark.sql("SELECT name FROM people") // The results of SQL queries are DataFramesand support

CS535 Big Data 2/12/2020 Week 4-B Sangmi Lee Pallickara

http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 32

Custom Grouping Stream

public interface CustomStreamGrouping extends Serializable { void prepare( WorkerTopologyContext context,

GlobalStreamId stream, List < Integer > targetTasks );

List < Integer > chooseTasks(int taskId, List < Object > values);

}

CS535 Big Data | Computer Science | Colorado State University

Example of grouping (1/2)• nextTuple() method of SentenceSpout

--- FINAL COUNTS ---a : 2 ate : 2 beverages : 2 cold : 2 cow : 2 dog : 4 don't : 4fleas : 4 has : 2 have : 2 homework : 2i : 6 like : 4 man : 2 my : 4 the : 2 think : 2 --------------

public void nextTuple() { if(index < sentences.length){ this.collector.emit(new Values(sentences[index])); index ++;

} Utils.waitForMillis(1);

}

CS535 Big Data | Computer Science | Colorado State University

Page 33: CS535 BIG DATA PART A. BIG DATA TECHNOLOGY 3. …cs535/slides/week4-B-2.pdf · valresults =spark.sql("SELECT name FROM people") // The results of SQL queries are DataFramesand support

CS535 Big Data 2/12/2020 Week 4-B Sangmi Lee Pallickara

http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 33

Example of grouping (2/2)• Now change the grouping on the

CountBolt parameter to a shuffle grouping and rerun the topology:

--- FINAL COUNTS ---a : 1 ate : 2 beverages : 1 cold : 1 cow : 1 dog : 2 don't : 2fleas : 1 has : 1 have : 1 homework : 1i : 3 like : 1 man : 1 my : 1 the : 1 think : 1 --------------

WHY?

--- FINAL COUNTS ---a : 2 ate : 2 beverages : 2 cold : 2 cow : 2 dog : 4 don't : 4fleas : 4 has : 2 have : 2 homework : 2i : 6 like : 4 man : 2 my : 4 the : 2 think : 2 --------------

builder.setBolt(COUNT_BOLT_ID, countBOLT, 4).shuffleGrouping(SPLIT_BOLT_ID);

CS535 Big Data | Computer Science | Colorado State University

builder.setBolt(SPLIT_BOLT_ID, splitBolt).shuffleGrouping(SENTENCE_SPOUT_ID); builder.setBolt(COUNT_BOLT_ID, countBolt) .fieldsGrouping(SPLIT_BOLT_ID, new Fields(“word"));builder.setBolt(REPORT_BOLT_ID, reportBolt) .globalGrouping(COUNT_BOLT_ID);

Why?• The CountBolt parameter is stateful

• It maintains a count for each word it’s seen• Shuffle grouping will not capture the total count of appearances of a word

CS535 Big Data | Computer Science | Colorado State University

Page 34: CS535 BIG DATA PART A. BIG DATA TECHNOLOGY 3. …cs535/slides/week4-B-2.pdf · valresults =spark.sql("SELECT name FROM people") // The results of SQL queries are DataFramesand support

CS535 Big Data 2/12/2020 Week 4-B Sangmi Lee Pallickara

http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 34

Questions?

CS535 Big Data | Computer Science | Colorado State University