Spark Summit EU talk by Ted Malaska

SPARK SUMMIT EUROPE 2016

Mastering Spark Unit Testing

Theodore MalaskaBlizzard Entertainment, Group Technical Architect

About Me▪ Ted Malaska

- Architect at Blizzard Ent- Working on Battle.net

- PSA at Cloudera- Worked with over 100 companies- Committed to over a dozen OS Projects- Co-Wrote a Book

- Architect at FINRA- Worked on OATS

About Blizzard• Maker of great games

– World of Warcraft– StarCraft– Hearthstone– Overwatch– Heroes of the Storm– Diablo

• Tech Giant– Massive clusters world wide– > 100 billions of record of data a day– Driven to improve gamer experience through technology

Overview• Why to run Spark outside of a cluster• What to test• Running Local• Running as a Unit Test• Data Structures• Supplying data to Unit Tests

Why to Run Spark Outside of a Cluster?

• Time• Trusted Deployment• Logic• Money

What to test• Experiments• Complex logic• Samples• Business generated scenarios

What about Notebooks• UIs like Zeppelin• For experimentations• Quick feedback• Not for productization• Lacks IDE Features

Running Local• A test doesn’t always need to be a unit test• Running local in your IDE is priceless

Example• //Get spark context

val sc:SparkContext = if (runLocal) {val sparkConfig = new SparkConf()sparkConfig.set("spark.broadcast.compress", "false")sparkConfig.set("spark.shuffle.compress", "false")sparkConfig.set("spark.shuffle.spill.compress", "false")new SparkContext("local[2]", "DataGenerator", sparkConfig)

} else {val sparkConf = new SparkConf().setAppName("DataGenerator")new SparkContext(sparkConf)

}val filterWordSetBc = sc.broadcast(scala.io.Source.fromFile(filterWordFile).getLines.toSet)val inputRDD = sc.textFile(inputFolder)val filteredWordCount: RDD[(String, Int)] = filterAndCount(filterWordSetBc, inputRDD)filteredWordCount.collect().foreach{case(name: String, count: Int) => println(" - " + name + ":" + count)}

Live Demo

Things to Note• Needed a Spark Context• Parallelize function• Separate out testable work from driver code• Everything is fully debuggable

Now on to Unit Testing• Just like running local be easier

Example Codeclass FilterWordCountSuite extends FunSuite with SharedSparkContext {test("testFilterWordCount") {

val filterWordSetBc = sc.broadcast(Set[String]("the", "a", "is"))val inputRDD = sc.parallelize(Array("the cat and the hat", "the car is blue","the cat is in a car", "the cat lost his hat"))

val filteredWordCount: RDD[(String, Int)] =FilterWordCount.filterAndCount(filterWordSetBc, inputRDD)

assert(filteredWordCount.count() == 8)val map = filteredWordCount.collect().toMapassert(map.get("cat").getOrElse(0) == 3)assert(map.get("the").getOrElse(0) == 0)

Shared Spark Contexttrait SharedSparkContext extends BeforeAndAfterAll { self: Suite =>

@transient private var _sc: SparkContext = _

def sc: SparkContext = _sc

var conf = new SparkConf(false)

override def beforeAll() {_sc = new SparkContext("local[4]", "test", conf)super.beforeAll()

override def afterAll() {LocalSparkContext.stop(_sc)_sc = nullsuper.afterAll()

Maven Dependency<dependency>

<groupId>org.apache.spark</groupId><artifactId>spark-core_${scala.binary.version}</artifactId><version>${spark.version}</version><type>test-jar</type><scope>test</scope>

</dependency>

<dependency><groupId>org.scalatest</groupId><artifactId>scalatest_${scala.binary.version}</artifactId><version>2.2.1</version><scope>test</scope>

</dependency>

Live Demo

Data StructuresCollection RDD

Creating an RDDval inputRDD = sc.parallelize(Array("the cat and the hat",

"the car is blue","the cat is in a car", "the cat lost his hat"))

Method to Selecting Data• Developer defined data• Selected sampling of data from production system• Generated data from a data generator

– ssCheck– SparkTestingBase

• Requirement/Business driven selection and generation

What about DataFrames and SQLRDD

RowRowRow

DataFrame

How to create a DataFrame• Code and make a structType• Take the schema from an existing Table

Creating a Stuct Typehc.sql("create external table trans (user string, time string, amount string) " +"location '" + outputTableLocation + "'")

val rowRdd = sc.textFile(inputCsvFolder).map(r => Row(r.split(",")))

val userField = new StructField("user", StringType, nullable = true)val timeField = new StructField("time", StringType, nullable = true)val amountField = new StructField("amount", StringType, nullable = true)

val schema = StructType(Array(userField, timeField, amountField))

hc.createDataFrame(rowRdd, schema).registerTempTable("temp")

hc.sql("insert into trans select * from temp")

Taking a Schema from a Tablehc.sql("create external table trans (user string, time string, amount string) " +"location '" + outputTableLocation + "'")

val rowRdd = sc.textFile(inputCsvFolder).map(r => Row(r.split(",")))

val emptyDf = hc.sql("select * from trans limit 0")

hc.createDataFrame(rowRdd, emptyDf.schema).registerTempTable("temp")

hc.sql("insert into trans select * from temp")

What about Nested DataFrames• hc.sql("create external table trans (" +

" user string," +" time string," +" items array<struct<" +" amount: string " +">>) " +"location '" + outputTableLocation + "'")

val rowRdd = sc.parallelize(Array(Row("bob", 10, Seq(

Row(1, 2, 3, 4)))

Important NOTE• Spark should be used to bring the big to the small• In that case more of the logic may be in the small

Dataframe Testing Exampletest("RunCountingSQL") {

val rdd = sqlContext.sparkContext.parallelize(Array(Row("bob", "1", "1"), Row("bob", "1", "1"),Row("bob", "1", "1"), Row("cat", "1", "1"),Row("bob", "1", "1"), Row("cat", "1", "1"), Row("bob", "1", "1")

val userField = new StructField("user", StringType, nullable = true)val timeField = new StructField("time", StringType, nullable = true)val amountField = new StructField("amount", StringType, nullable = true)val schema = StructType(Array(userField, timeField, amountField))sqlContext.createDataFrame(rdd, schema).registerTempTable("trans")

val results = RunCountingSQL.runCountingSQL(sqlContext)

assert(results.length == 2)}

What about Hive• If you are testing Hive.

– Spark will create a Local HMS– Allows all the creation, deleting, writing– Make sure you can write to the table location– Also feel free to delete the metastore folder in-between jobs

Now for Streaming• Lets go one step more• What is a DStream• How do you test a DStream

Now for StreamingRDD DStream

TestOperation Exampleclass StreamFilterWordCount extends TestSuiteBase{test("mapPartitions") {

assert(numInputPartitions === 2, "Number of input partitions has been changed from 2")

val input = Seq(Seq("the cat and the hat", "the car is blue"),Seq("the cat is in a car", "the cat lost his hat"))

val output = Seq(Seq(9), Seq(11))

val operation = (r: DStream[String]) => StreamingWordCount.wordCount(r)

testOperation(input, operation, output)}

What about other things• MlLib• GraphX

What about Integration Testing• MiniClusters

– HBase– Kafka– HDFS– Cassandra– Many more

Testing True Distributed• Running on a Dev environment • Using Docker to spin up environment • Mini yarn or mesos

SPARK SUMMIT EUROPE 2016

THANK YOU.

Spark Summit EU talk by Ted Malaska

Data & Analytics

Spark SQL | Apache Spark

ESADE | Fusion Pointitemsweb.esade.edu/wi/documentos/Rules_and_Regulations... · 2017-02-03 · ! 2! Rules!and!regulations!! TED&TEDx!! TED!is!anon+profitdevoted!to!spreading!ideas!to!help!organizations!and!individuals!to!spark!

Ted Help Pages - TED home - TED Tenders Electronic Dailyted.europa.eu/TED/static/help/en.pdf · 2017-07-12 · Ted Help Pages. 1 Contents ... Browsing by business sector ... Saving

Spark Infrastructure - Australian Securities Exchange · Spark Infrastructure represents Spark Infrastructure Trust and its consolidated entities. Spark Infrastructure RE Limited

Spark Streaming Resiliency (Bay Area Spark Meetup)

Data processing in Apache Spark•Next week`s lecture is about higher level Spark –Scripting and Prototyping in Spark •Spark SQL •DataFrames –Spark Streaming Pelle Jakovits

Spark and Spark SQL - Amir H. Payberah · Spark and Spark SQL Amir H. Payberah amir@sics.se SICS Swedish ICT Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 1 / 71

Spark summit2014 techtalk - testing spark

Paris Spark meetup : Extension de Spark (Tachyon / Spark JobServer) par jlamiel

TED THED THED THED THED TED TED THED - b2bdocjohnson.com · 2019. 7. 31. · TED THED THED THED THED TED TED THED . Created Date: 7/23/2019 9:50:03 AM

TED-270PFL/PFLS TED-270PFR/PFRS TED-270PFHS/PFDH TED … · TED-270 GB-1 TED-270PFL/PFLS TED-270PFR/PFRS TED-270PFHS/PFDH TED-270PFDLS Owner's manual Read the manual carefully before

Kerberizing spark. Spark Summit east

Developing Apache Spark Applications - Cloudera · Apache Spark Quick Start Apache Spark Overview Apache Spark Programming Guide Using the Spark DataFrame API A DataFrame is a distributed

IBM Spark Meetup - RDD & Spark Basics

Apache Ignite and Apache Spark - GridGain Systems · Ignite and Spark Integration Spark Application Spark Worker Spark Job Spark Job Yarn Mesos Docker HDFS Spark Worker Spark Job

McDonough Spark Tutorial Spark Summit 2013

Spark SQL and DataFrames Spark GraphX Spark Mlib Spark ...Spark GraphX! Spark Mlib! Spark Streaming Lightning-fast cluster computing. Chaining transformations 2. ... Covert RDD to

Top 5 Mistakes When Writing Spark Applications by Mark Grover and Ted Malaska

Mazda RX-8 Spark Plug and Spark Plug Wire Install Guide5xracing.com/...spark-plug-and-spark-plug-wire-installation-guide.pdf · Mazda RX-8 Spark Plug and Spark Plug Wire Install Guide

Spark streaming , Spark SQL