CS535 BIG DATA PART A. BIG DATA TECHNOLOGY 3. …cs535/slides/week4-B-2.pdf · valresults =spark.sql("SELECT name FROM people") // The results of SQL queries are DataFramesand support

CS535 Big Data 2/12/2020 Week 4-B Sangmi Lee Pallickara

http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 1

CS535 BIG DATA

PART A. BIG DATA TECHNOLOGY3. DISTRIBUTED COMPUTING MODELS FOR SCALABLE BATCH COMPUTINGSECTION 2: IN-MEMORY CLUSTER COMPUTING

Sangmi Lee PallickaraComputer Science, Colorado State Universityhttp://www.cs.colostate.edu/~cs535

FAQs• Submission Deadline for the GEAR Session 1 review

• Feb 25• Presenters: please upload (canvas) your slides at least 2 hours before the presentation session

CS535 Big Data | Computer Science | Colorado State University



Topics of Todays Class• 3. Distributed Computing Models for Scalable Batch Computing

• Data Frame• Spark SQL• Datasets

• 4. Real-time Streaming Computing Models: Apache Storm and Twitter Heron• Apache storm model• Parallelism • Grouping methods


In-Memory Cluster Computing: Apache SparkSQL, DataFrames and Datasets




What is the Spark SQL?• Spark module for structured data processing

• Interface is provided by Spark• SQL and the Dataset API

• Spark SQL is to execute SQL queries• Available with the command-line or over JDBC/ODBC


What is the Datasets?• Dataset is a distributed collection of data• New interface added in Spark (since V1.6) provides

• Benefits of RDDs (Storing typing, ability to use lambda functions)• Benefits of Spark SQL’s optimized execution engine

• Available in Scala and Java• Python does not support Datasets APIs




What is the DataFrames?• DataFrame is a Dataset organized into named columns

• Like a table in a relational database or a data frame in R/Python• Strengthened optimization scheme

• Available with Scala, Java, Python, and R



Getting Started




Create a SparkSession: Starting Point• SparkSession

• The entry point into all functionality in Spark

import org.apache.spark.sql.SparkSession

val spark = SparkSession.builder().appName("Spark SQL basic example").config("spark.some.config.option", "some-value").getOrCreate()

// For implicit conversions like converting RDDs to DataFramesimport spark.implicits._

Find full example code at the Spark repoexamples/src/main/scala/org/apache/spark/examples/sql/SparkSQLExample.scala


Creating DataFrames• With a SparkSession, applications can create DataFrames from

• Existing RDD• Hive table• Spark data sources

val df = spark.read.json("examples/src/main/resources/people.json")

// Displays the content of the DataFrame to stdoutdf.show()// +----+-------+// | age| name |// +----+-------+// |null|Michael|// | 30| Andy |// | 19| Justin |// +----+-------+




Untyped Dataset Operation (A.K.A. DataFrame Operations)• DataFrames are just Dataset of Rows in Scala and Java API

• Untyped transformations• “typed operations”?

• Strongly typed Scala/Java Datasets

// This import is needed to use the $-notationimport spark.implicits._

// Print the schema in a tree formatdf.printSchema()

// root// |-- age: long (nullable = true)// |-- name: string (nullable = true)


Untyped Dataset Operation (A.K.A. DataFrame Operations)// Select only the "name" columndf.select("name").show()

// +-------+// | name|// +-------+// |Michael|// | Andy|// | Justin|// +-------+




Untyped Dataset Operation (A.K.A. DataFrame Operations)// Select everybody, but increment the age by 1df.select($"name", $"age" + 1).show()

// +-------+---------+// | name. |(age + 1)|// +-------+---------+// |Michael| null|// | Andy. | 31|// | Justin| 20|// +-------+---------+


Untyped Dataset Operation// Select people older than 21df.filter($"age" > 21).show()

// +---+----+// |age|name|// +---+----+// | 30|Andy|// +---+----+




Running SQL Queries • SELECT * FROM people// Register the DataFrame as a SQL temporary viewdf.createOrReplaceTempView("people")

val sqlDF = spark.sql("SELECT * FROM people")sqlDF.show()

// +----+-------+// | age| name|// +----+-------+// |null|Michael|// | 30| Andy|// | 19| Justin|// +----+-------+


Global Temporary View• Temporary views in Spark SQL

• Session-scoped • Will disappear if the session that creates it terminates

• Global temporary view• Shared among all sessions and keep alive until the Spark application terminates• A system preserved database




Global Temporary View• SELECT * FROM people// Register the DataFrame as a global temporary viewdf.createGlobalTempView("people")

// Global temporary view is tied to a system preserved database `global_temp`

spark.sql("SELECT * FROM global_temp.people").show()

// +----+-------+// | age| name |// +----+-------+// |null|Michael|// | 30 | Andy |// | 19 | Justin|// +----+-------+


Global Temporary View// Global temporary view is cross-session

spark.newSession().sql("SELECT * FROM global_temp.people").show()

// +----+-------+// | age| name. |// +----+-------+// |null|Michael|// | 30 | Andy |// | 19 | Justin|// +----+-------+




Creating Datasets• Datasets are similar to RDDs

• Serializes object with Encoder (not standard java/Kryo serialization)• Datasets are using non-standard serialization library (Spark’s Encoder)

• Many of Spark Dataset operations can be performed without deserializing object

case class Person(name: String, age: Long)

// Encoders are created for case classesval caseClassDS = Seq(Person("Andy", 32)).toDS()caseClassDS.show()

// +----+---+// |name|age|// +----+---+// |Andy| 32|// +----+---+


Creating Datasets// Encoders for most common types are automatically provided by importing spark.implicits._

val primitiveDS = Seq(1, 2, 3).toDS()primitiveDS.map(_ + 1).collect() // Returns: Array(2, 3, 4)

// DataFrames can be converted to a Dataset by providing a class. // Mapping will be done by nameval path = "examples/src/main/resources/people.json"val peopleDS = spark.read.json(path).as[Person]peopleDS.show()

// +----+-------+// | age| name|// +----+-------+// |null|Michael|// | 30 | Andy. |// | 19 | Justin|// +----+-------+





Interacting with RDDs


Interoperating with RDDs• Converting RDDs into Datasets

• Case 1: Using reflections to infer the schema of an RDD • Case 2: Using a programmatic interface to construct a schema and then apply it to an existing

RDD




Interoperating with RDDs: 1. Using Reflection• Automatic converting of an RDD (containing case classes) to a DataFrame

• The case class defines the schema of the table

• E.g. the names of the arguments to the case class are read using reflection • become the names of the columns

• Case classes can also be nested or contain complex types such as Seqs or Arrays

• RDD will be implicitly converted to a DataFrame and then be registered as a table


Interoperating with RDDs: 1. Using Reflection// For implicit conversions from RDDs to DataFramesimport spark.implicits._

// Create an RDD of Person objects from a text file, convert it to a Dataframeval peopleDF = spark.sparkContext

.textFile("examples/src/main/resources/people.txt")

.map(_.split(","))

.map(attributes => Person(attributes(0), attributes(1).trim.toInt))

.toDF()

// Register the DataFrame as a temporary viewpeopleDF.createOrReplaceTempView("people")

// SQL statements can be run by using the sql methods provided by Sparkval teenagersDF = spark.sql("SELECT name, age FROM people WHERE age BETWEEN 13 AND 19")




Interoperating with RDDs: 2. Specifying the Schema Programmatically• When case classes cannot be defined ahead of time

• Step 1: Create an RDD of Rows from the original RDD

• Steo 2: Create the schema represented by a StructType matching the structure of Rows in the RDD created in Step 1

• Step 3: Apply the schema to the RDD of Rows via createDataFrame method provided by SparkSession.


import org.apache.spark.sql.types._

// Create an RDDval peopleRDD =spark.sparkContext.textFile("examples/src/main/resources/people.txt")

// The schema is encoded in a stringval schemaString = "name age"

// Generate the schema based on the string of schemaval fields = schemaString.split(" ")

.map(fieldName => StructField(fieldName, StringType, nullable = true))val schema = StructType(fields)

// Convert records of the RDD (peopleRDD) to Rowsval rowRDD = peopleRDD

.map(_.split(","))

.map(attributes => Row(attributes(0), attributes(1).trim))

Interoperating with RDDs: 2. Specifying the Schema Programmatically




// Apply the schema to the RDDval peopleDF = spark.createDataFrame(rowRDD, schema)

// Creates a temporary view using the DataFramepeopleDF.createOrReplaceTempView("people")

// SQL can be run over a temporary view created using DataFramesval results = spark.sql("SELECT name FROM people")

// The results of SQL queries are DataFrames and support all the RDD operations// The columns of a row in the result can be accessed by field index or by field results.map(attributes => "Name: " + attributes(0)).show()// +-------------+// | value|// +-------------+// |Name: Michael|// | Name: Andy|// | Name: Justin|// +-------------+

Interoperating with RDDs: 2. Specifying the Schema Programmatically


Datasets and Type Safety• Datasets are composed of typed objects

• transformation syntax errors (e.g. a typo in the method name) and analysis errors (e.g. an incorrect input variable type) can be caught at compile time

• DataFrames are composed of untyped Row objects • only syntax errors can be caught at compile time

• Spark SQL is composed of a string • syntax errors and analysis errors are only caught

at runtime




CS535 BIG DATA

PART A. BIG DATA TECHNOLOGY4. REAL-TIME STREAMING COMPUTING MODELS: APACHE STORM AND TWITTER HERON

Sangmi Lee PallickaraComputer Science, Colorado State Universityhttp://www.cs.colostate.edu/~cs535

This material is built based on• Ankit Toshniwal, Siddarth Taneja, Amit Shukla, Karthik Ramasamy, Jignesh M. Patel,

Sanjeev Kulkarni, Jason Jackson, Krishna Gade, Maosong Fu, Jake Donham, Nikunj Bhagat, Sailesh Mittal, and Dmitriy Ryaboy. 2014. Storm@twitter. In Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data (SIGMOD '14). ACM, New York, NY, USA, 147-156. DOI: https://doi.org/10.1145/2588555.2595641

• Sanjeev Kulkarni, Nikunj Bhagat, Maosong Fu, Vikas Kedigehalli, Christopher Kellogg, Sailesh Mittal, Jignesh M. Patel, Karthik Ramasamy, and Siddarth Taneja. 2015. Twitter Heron: Stream Processing at Scale. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data (SIGMOD '15). ACM, New York, NY, USA, 239-250. DOI: https://doi.org/10.1145/2723372.2742788

• Storm programming guide• http://storm.apache.org/releases/2.0.0-SNAPSHOT/index.html




4. Real-time Streaming Computing Models: Apache Storm and Twitter Heron

Apache Storm Model


Storm Model• One-at-a-time stream processing• Represents the entire stream processing pipeline as a graph of computation called a

topology

• A single program is deployed across a cluster

• A stream is represented an infinite sequence of tuples• A tuple: a named list of values

Data stream

Tuple Tuple Tuple Tuple Tuple Tuple




Why use Storm?• Distributed real-time computation system

• Realtime analytics• Online machine learning• Continuous computation• Distributed RPC• ETL

• Use Case• Twitter• Amazon Kinesis and Storm

• KinesisSpoutConfig• KinesisSpout


Spout in the Storm model• Spout

• A source of streams in a topology• A spout can read from a Kestrel or Kafka queue• Turns the data into a tuple stream

Data stream

Tuple Tuple Tuple Tuple

Data streamTuple Tuple Tuple Tuple




Bolt in the Storm model• Bolt

• Performs actions on streams• Takes any number of streams as input and produces any number of streams as output• Runs functions, filters data, computes aggregations, does streaming joins, updates database, etc.

Data stream

TupleTuple

Tuple

Data streamTuple Tuple Tuple Data stream

Tuple Tuple


Topology in the Storm model• Topology

• A network of spouts and bolts with each edge representing a bolt that processes the output stream of another spout or bolt

• Task• Each instance of a spout or bolt





Apache StormWord Count Example


Word count topology: Sentence Spout

• Sentence spout• Emits a stream of single-value tuples continuously with the key name

“sentence” and a string value{“sentence”:”my dog has fleas”}

Sentence Spout Split SentenceBolt

Word Count Bolt Report Bolt




Word count topology: Split Sentence

• Split Sentence Bolt• Subscribes to the sentence spout’s tuple stream{“word”:”my”}{“word”:”dog”}{“word”:”has”}{“word”:”fleas”}




Word count topology: Word Count

• Word count bolt• Subscribes to the output of the SplitSentenceBolt class• Keeps a count of how many times it has seen a particular word• Whenever it receives a tuple, it will increment the counter and emit• {“word”:”dog”, “count”:5}






Word count topology: Report

• Report bolt• Subscribes to the output of the WordCountBolt class• Keeps a count of how many times it has seen a particular word• Whenever it receives a tuple, it will update the table and print the contents

to the console




SentenceSpout.javapublic class SentenceSpout extends BaseRichSpout {

private SpoutOutputCollector collector; private String[] sentences = {

"my dog has fleas", "i like cold beverages", "the dog ate my homework", "don't have a truck", "i don't think i like fleas"

};

private int index = 0; public void declareOutputFields(OutputFieldsDeclarer declarer) {

declarer.declare(new Fields(“sentence")); }




SentenceSpout.java: Continuedpublic void open(Map config, TopologyContext context,

SpoutOutputCollector collector) { this.collector = collector;

}

public void nextTuple() { this.collector.emit(new Values(sentences[index]));

index ++; if (index > = sentences.length) {

index = 0; }

Utils.waitForMillis(1);

}}


SplitSentenceBolt.javapublic class SplitSentenceBolt extends BaseRichBolt{

private OutputCollector collector; public void prepare( Map config, TopologyContext context,

OutputCollector collector) {

this.collector = collector; } public void execute(Tuple tuple) {

String sentence = tuple.getStringByField(“sentence"); String[] words = sentence.split(" "); for(String word : words) {

this.collector.emit(new Values(word)); }

} public void declareOutputFields(OutputFieldsDeclarer declarer) {

declarer.declare(new Fields(“word")); }

}




WordCountBolt.javapublic class WordCountBolt extends BaseRichBolt{

private OutputCollector collector; private HashMap < String, Long > counts = null; public void prepare(Map config, TopologyContext context, OutputCollector

collector) { this.collector = collector; this.counts = new HashMap < String, Long >();

}public void execute(Tuple tuple) {

String word = tuple.getStringByField(“word"); Long count = this.counts.get(word); if(count == null){ count = 0; } count ++; this.counts.put(word, count); this.collector.emit(new Values(word, count));

} public void declareOutputFields( OutputFieldsDeclarer declarer) {

declarer.declare(new Fields(“word", "count")); }

}


ReportBolt.javapublic class ReportBolt extends BaseRichBolt {

private HashMap < String, Long > counts = null; public void prepare(Map config, TopologyContext context, OutputCollector collector) {

this.counts = new HashMap < String, Long >(); } public void execute(Tuple tuple) {

String word = tuple.getStringByField(“word"); Long count = tuple.getLongByField(“count"); this.counts.put(word, count);

} public void declareOutputFields(OutputFieldsDeclarer declarer) {//no output} public void cleanup() {

System.out.println("--- FINAL COUNTS ---");List < String > keys = new ArrayList < String >(); keys.addAll( this.counts.keySet()); Collections.sort(keys); for (String key : keys) {

System.out.println(key + " : " + this.counts.get(key)); }

}}




WordCountTopology.javapublic class WordCountTopology {

private static final String SENTENCE_SPOUT_ID = "sentence-spout"; private static final String SPLIT_BOLT_ID = "split-bolt"; private static final String COUNT_BOLT_ID = "count-bolt"; private static final String REPORT_BOLT_ID = "report-bolt"; private static final String TOPOLOGY_NAME = "word-count-topology";

public static void main( String[] args) throws Exception { SentenceSpout spout = new SentenceSpout(); SplitSentenceBolt splitBolt = new SplitSentenceBolt(); WordCountBolt countBolt = new WordCountBolt(); ReportBolt reportBolt = new ReportBolt(); TopologyBuilder builder = new TopologyBuilder(); builder.setSpout(SENTENCE_SPOUT_ID, spout); // SentenceSpout --> SplitSentenceBoltbuilder.setBolt(SPLIT_BOLT_ID,

splitBolt).shuffleGrouping(SENTENCE_SPOUT_ID);


WordCountTopology.java Continued

// SplitSentenceBolt--> WordCountBoltbuilder.setBolt(COUNT_BOLT_ID, countBolt)

.fieldsGrouping(SPLIT_BOLT_ID, new Fields(“word")); // WordCountBolt --> ReportBoltbuilder.setBolt(REPORT_BOLT_ID, reportBolt)

.globalGrouping(COUNT_BOLT_ID); Config config = new Config();

LocalCluster cluster = new LocalCluster(); cluster.submitTopology(TOPOLOGY_NAME, config,

builder.createTopology()); waitForSeconds(10); cluster.killTopology(TOPOLOGY_NAME); cluster.shutdown();

} }




Results--- FINAL COUNTS ---a : 1426 ate : 1426 beverages : 1426 cold : 1426 cow : 1426 dog : 2852 don't : 2851 fleas : 2851 has : 1426 have : 1426 homework : 1426 i : 4276 like : 2851 man : 1426 my : 2852 the : 1426 think : 1425 --------------



Apache Storm

Parallelism in Storm




Components of the Storm cluster• Nodes (machines)

• Executes portions of a topology• Workers (JVMs)

• Independent JVM processes running on a node• Each node is configured to run one or more workers • A topology may request one or more workers to be assigned to it

• Executors (threads)• Java threads running within a worker JVM process• Multiple tasks can be assigned to a single executor

• Unless explicitly overridden, Storm will assign one task to each executor

• Tasks (bolt/ spout instances)• Instances of spouts and bolts whose nextTuple() and execute() methods are called by executor

threads


Parallelism in the WordCount topology• In our example, we have NOT used any of Storm’s parallelism

• Default setting is a factor of one

• Topology execution flow

Executor(Thread)

Task (Sentence spout)

Executor(Thread)

Task (Report Bolt)

Executor(Thread)

Task (Word Count Bolt)

Executor(Thread)

Task (Split Sentence Bolt)

Worker (JVM)

Node




Adding workers to a topology• Through configuration

• Through APIs• Passing Config object to the submitTopology() method

• Bolts and spouts do not have to change

Config config = new Config();config.setNumWorkers(2);


Adding executors and tasks• Specify the number of executors when defining a stream grouping

• builder.setSpout(SENTENCE_SPOUT_ID, spout, 2);• Assigns two tasks and each task is assigned its own executor thread

2 executors to a single worker




Two spout tasks (if we are using one worker)

Executor(Thread)


Executor(Thread)

Task (Report Bolt)

Executor(Thread)

Task (Word Count Bolt)

Executor(Thread)

Task (Split Sentence Bolt)

Worker (JVM)

Node

Executor(Thread)



builder.setBolt(SPLIT_BOLT_ID, splitBolt, 2).setNumTasks(4).shuffleGrouping(SENTENCE_SPOUT_ID);

In SplitSentenceBolt and WordCountBolt,

• Set up the split sentence bolt to execute as 4 tasks and 2 executors• Parallelism hint• Storm will run 2 tasks per executor (thread)• Each executor thread will be assigned two tasks to execute

• How many workers will work for this example?


Total tasks are 4




builder.setBolt(SPLIT_BOLT_ID, splitBolt, 2).setNumTasks(4).shuffleGrouping(SENTENCE_SPOUT_ID);

In SplitSentenceBolt and WordCountBolt,

• Set up the split sentence bolt to execute as 4 tasks and 2 executors• Parallelism hint• Storm will run 2 tasks per executor (thread)• Each executor thread will be assigned two tasks to execute

• How many workers will work for this example?• Answer: 2 workers (JVMs)

• 2 executors per worker


Total tasks are 4



Apache Storm

Stream Groupings




Stream groupings• How a stream’s tuples are distributed among bolt tasks in a topology

• E.g. SplitSentenceBolt class was assigned four tasks in the topology• Which tuples will be processed in which task?

• The stream grouping determines which one of those tasks will receive a given tuple


Seven built-in stream groupings (1/3)• Shuffle grouping

• Randomly distributes tuples across the target bolt’s tasks

• Fields grouping• Routes tuples to bolt tasks based on the values of the fields specified in the grouping• Grouped on the “word” field

• Tuples with the same value for the “word” field will always be routed to the same bolt task

• All grouping• Replicates the tuple stream across all bolt tasks




Seven built-in stream groupings (2/3)• Global grouping

• Routes all tuples in a stream to a single task• Chooses the task with the lowest task ID value

• None grouping• Functionally equivalent to the shuffle grouping• Reserved for future use

• Direct grouping• The source stream decides which component will receive a given tuple

• By calling the emitDirect() method• Only for streams that have been declared as direct streams


Seven built-in stream groupings (3/3)• Local or shuffle grouping

• Shuffles tuples among bolt tasks running in the same worker process, if any• Otherwise, performs shuffle grouping• Depending on the parallelism of a topology, the local or shuffle grouping can increase topology

performance by limiting network transfer




Custom Grouping Stream

public interface CustomStreamGrouping extends Serializable { void prepare( WorkerTopologyContext context,

GlobalStreamId stream, List < Integer > targetTasks );

List < Integer > chooseTasks(int taskId, List < Object > values);

}


Example of grouping (1/2)• nextTuple() method of SentenceSpout

--- FINAL COUNTS ---a : 2 ate : 2 beverages : 2 cold : 2 cow : 2 dog : 4 don't : 4fleas : 4 has : 2 have : 2 homework : 2i : 6 like : 4 man : 2 my : 4 the : 2 think : 2 --------------

public void nextTuple() { if(index < sentences.length){ this.collector.emit(new Values(sentences[index])); index ++;

} Utils.waitForMillis(1);

}




Example of grouping (2/2)• Now change the grouping on the

CountBolt parameter to a shuffle grouping and rerun the topology:


WHY?


builder.setBolt(COUNT_BOLT_ID, countBOLT, 4).shuffleGrouping(SPLIT_BOLT_ID);


builder.setBolt(SPLIT_BOLT_ID, splitBolt).shuffleGrouping(SENTENCE_SPOUT_ID); builder.setBolt(COUNT_BOLT_ID, countBolt) .fieldsGrouping(SPLIT_BOLT_ID, new Fields(“word"));builder.setBolt(REPORT_BOLT_ID, reportBolt) .globalGrouping(COUNT_BOLT_ID);

Why?• The CountBolt parameter is stateful

• It maintains a count for each word it’s seen• Shuffle grouping will not capture the total count of appearances of a word




Questions?


Documents

CS535 BIG DATA PART A. BIG DATA TECHNOLOGY 3. …cs535/slides/week4-B-2.pdf · valresults =spark.sql("SELECT name FROM people") // The results of SQL queries are DataFramesand support