Spark RDD-DF-SQL-DS-Spark Hadoop User Group Munich Meetup 2016

Making sense of RDDs, DataFrames, SparkSQL andDatasets APIsMotivation

overview over the different Spark APIs for working with structured data.

Timeline of Spark APIsSpark 1.0 used the RDD API - a Resilient Distributed Dataset (RDD), the basic abstraction in Spark. Represents an immutable,partitioned collection of elements that can be operated on in parallel.Spark 1.3 introduced the DataFrames API - a distributed collection of data organized into named columns. Also a well knownconcept from R / Python Pandas.Spark 1.6 introduced an experimental Datasets API - extension of the DataFrame API that provides a type-safe, object-orientedprogramming interface.

RDDRDD - Resilient Distributed DatasetFunctional transformations on partitioned collections of opaque objects.

Define case class representing schema of our data.

Each field represent column of the DB.

>

defined class Person

Create parallelized collection (RDD)

>

peopleRDD: org.apache.spark.rdd.RDD[Person] = ParallelCollectionRDD[5885] at parallelize at <console>:95

RDD of type Person

>

rdd: org.apache.spark.rdd.RDD[Person] = MapPartitionsRDD[5886] at filter at <console>:98

NB: Return Person objects

>

Person(Sven,38)

case class Person(name: String, age: Int)

val peopleRDD = sc.parallelize(Array( Person("Lars", 37), Person("Sven", 38), Person("Florian", 39), Person("Dieter", 37)))

val rdd = peopleRDD .filter(_.age > 37)

rdd .collect .foreach(println(_))

(http://databricks.com) ð Import Notebook

RDDs, SQL, DataFrames and DataSets

Person(Florian,39)

DataFramesDeclarative transformations on partitioned collection of tuples.

>

peopleDF: org.apache.spark.sql.DataFrame = [name: string, age: int]

>

+-------+---+ | name|age| +-------+---+ | Lars| 37| | Sven| 38| |Florian| 39| | Dieter| 37| +-------+---+

>

root |-- name: string (nullable = true) |-- age: integer (nullable = false)

Show only age column

>

+---+ |age| +---+ | 37| | 38| | 39| | 37| +---+

NB: Result set consists of Arrays of String und Ints

>

[Sven,38] [Florian,39]

DataSetsCreate DataFrames from RDDs

Implicit conversion is also available

>

peopleDS: org.apache.spark.sql.Dataset[Person] = [name: string, age: int]

NB: Result set consist of Person objects

val peopleDF = peopleRDD.toDF

peopleDF.show()

peopleDF.printSchema()

peopleDF.select("age").show()

peopleDF .filter("age > 37") .collect .foreach(row => println(row))

val peopleDS = peopleRDD.toDS

>

ds: org.apache.spark.sql.Dataset[Person] = [name: string, age: int]

>

Person(Sven,38) Person(Florian,39)

>

res294: org.apache.spark.sql.catalyst.plans.logical.LogicalPlan = MapPartitions <function1>, class[name[0]: string, age[0]: int], class[name[0]: string, age[0]: int], [name#11044,age#11045]+- LogicalRDD [name#11041,age#11042], MapPartitionsRDD[5894] at rddToDatasetHolder at <console>:97

>

res295: org.apache.spark.sql.catalyst.plans.logical.LogicalPlan = MapPartitions <function1>, class[name[0]: string, age[0]: int], class[name[0]: string, age[0]: int], [name#11044,age#11045]+- LogicalRDD [name#11041,age#11042], MapPartitionsRDD[5894] at rddToDatasetHolder at <console>:97

>

Person(Sven,38) Person(Florian,39)

Spark SQL>

import org.apache.spark.sql.SQLContext sqlContext: org.apache.spark.sql.SQLContext = org.apache.spark.sql.hive.HiveContext@50ec5129

Register DataFrame for usage via SQL

>

The results of SQL queries are DataFrames and support all the usual RDD operations.

>

res298: org.apache.spark.sql.DataFrame = [name: string, age: int]

Print execution plan

>

res299: org.apache.spark.sql.catalyst.plans.logical.LogicalPlan = Project [name#11037,age#11038] +- Filter (age#11038 > 37) +- Subquery sparkpeopletbl +- LogicalRDD [name#11037,age#11038], MapPartitionsRDD[5887] at rddToDataFrameHolder at <console>:97

Optimized by built-in optimizer execution plan

val ds = peopleDS .filter(_.age > 37)

ds.collect .foreach(println(_))

ds.queryExecution.analyzed

ds.queryExecution.optimizedPlan

ds.collect .foreach(println(_))

// Get SQL context from Spark context// NB: "In Databricks, developers should utilize the shared HiveContext instead of creating one using the constructor. In Scala and Python notebooks, the shared context can be accessed as sqlContext. When running a job, you can access theshared context by calling SQLContext.getOrCreate(SparkContext.getOrCreate())."

import org.apache.spark.sql.SQLContextval sqlContext = SQLContext.getOrCreate(SparkContext.getOrCreate())

peopleDF.registerTempTable("sparkPeopleTbl")

sqlContext.sql("SELECT * FROM sparkPeopleTbl WHERE age > 37")

sql("SELECT * FROM sparkPeopleTbl WHERE age > 37").queryExecution.analyzed

>

res300: org.apache.spark.sql.catalyst.plans.logical.LogicalPlan = Project [name#11037,age#11038] +- Filter (age#11038 > 37) +- LogicalRDD [name#11037,age#11038], MapPartitionsRDD[5887] at rddToDataFrameHolder at <console>:97

>

4

Sven 38

Florian 39

>

res301: Array[org.apache.spark.sql.Row] = Array([Sven,38], [Florian,39])

>

res302: Array[String] = Array(NAME: Sven, NAME: Florian)

>

res303: Array[String] = Array(NAME: Sven, NAME: Florian)

>

res304: org.apache.spark.sql.DataFrame = [name: string, age: int]

Running SQL queries agains Parquet files directly>

? s3a://bigpicture-guild/nyctaxi/sample_1_month/parquet/_SUCCESS [0.00 MiB] ? s3a://bigpicture-guild/nyctaxi/sample_1_month/parquet/_common_metadata [0.00 MiB] ? s3a://bigpicture-guild/nyctaxi/sample_1_month/parquet/_metadata [0.03 MiB] ? s3a://bigpicture-guild/nyctaxi/sample_1_month/parquet/part-r-00000-5df3eb09-ce03-4ac6-8770-287ce4782749.gz.parquet [53.92MiB] ? s3a://bigpicture-guild/nyctaxi/sample_1_month/parquet/part-r-00001-5df3eb09-ce03-4ac6-8770-287ce4782749.gz.parquet [53.87MiB] ? s3a://bigpicture-guild/nyctaxi/sample_1_month/parquet/part-r-00002-5df3eb09-ce03-4ac6-8770-287ce4782749.gz.parquet [53.84MiB] ? s3a://bigpicture-guild/nyctaxi/sample_1_month/parquet/part-r-00003-5df3eb09-ce03-4ac6-8770-287ce4782749.gz.parquet [53.92MiB] ? s3a://bigpicture-guild/nyctaxi/sample_1_month/parquet/part-r-00004-5df3eb09-ce03-4ac6-8770-287ce4782749.gz.parquet [53.89MiB] ? s3a://bigpicture-guild/nyctaxi/sample_1_month/parquet/part-r-00005-5df3eb09-ce03-4ac6-8770-287ce4782749.gz.parquet [53.87MiB] ? s3a://bigpicture-guild/nyctaxi/sample_1_month/parquet/part-r-00006-5df3eb09-ce03-4ac6-8770-287ce4782749.gz.parquet [53.89MiB] ? s3a://bigpicture-guild/nyctaxi/sample_1_month/parquet/part-r-00007-5df3eb09-ce03-4ac6-8770-287ce4782749.gz.parquet [53.89MiB] ? s3a://bigpicture-guild/nyctaxi/sample_1_month/parquet/part-r-00008-5df3eb09-ce03-4ac6-8770-287ce4782749.gz.parquet [53.89MiB] ? s3a://bigpicture-guild/nyctaxi/sample_1_month/parquet/part-r-00009-5df3eb09-ce03-4ac6-8770-287ce4782749.gz.parquet [53.95MiB]

>

0E0157EB3E6F927FE7FA86C3A0762B3B 4E0569072023608FFEB72F454CCF408B VTS 1 2013-01-13T11:08:00.000+0000

19BF1BB516C4E992EA3FBAEDA73D6262 E4CAC9101BFE631554B57906364761D3 VTS 2 2013-01-13T10:33:00.000+0000

D57C7392455C38D9404660F7BC63D1B6 EC5837D805127379D72FF6C35279890B VTS 1 2013-01-13T04:58:00.000+0000

67108EDF8123623806A1DAFE8811EE63 36C9437BD2FF31940BEBED44DDDDDB8A VTS 5 2013-01-

name age

medallion hack_license vendor_id rate_code store_and_fwd_flag pickup_datetime

sql("SELECT * FROM sparkPeopleTbl WHERE age > 37").queryExecution.optimizedPlan

%sql SELECT * FROM sparkPeopleTbl WHERE age > 37

sql("SELECT * FROM sparkPeopleTbl WHERE age > 37").collect

sql("SELECT * FROM sparkPeopleTbl WHERE age > 37").map(row => "NAME: " + row(0)).collect

sql("SELECT * FROM sparkPeopleTbl WHERE age > 37").rdd.map(row => "NAME: " + row(0)).collect()

sql("SELECT * FROM sparkPeopleTbl WHERE age > 37")

ls3("s3a://bigpicture-guild/nyctaxi/sample_1_month/parquet/")

%sql SELECT * FROM parquet.`s3a://bigpicture-guild/nyctaxi/sample_1_month/parquet/` WHERE trip_time_in_secs <= 60

4

Showing the first 1000 rows.

13T04:58:00.000+0000

5F78CC6D4ECD0541B765FECE17075B6F 39703F5449DADC0AEFFAFEFB1A6A7E64 VTS 1 2013-01-13T08:56:00.000+0000

ConclusionsRDDsRDDs remain the core component for the native distributed collections. But due to lack of built-in optimization, DFs and DSs should

be prefered.

DataFramesDataFrames and Spark SQL are very flexible and bring built-in optimization also for dynamic languages like Python and R. Beside

this, it allows to combine both declarative and functional way of working with structured data. It's regarded to be most stable and

flexible API.

DatasetsDatasets unify the best from both worlds: type safety from RDDs and built-in optimization available for DataFrames. But it is in

experimental phase and can be used with Scala, Java and Python. DSs allow even further optimizations (memory compaction + faster

serialization using encoders). DataSets are not yet mature, but it is expected, that it will quickly become a common way to work with

structured data, and Spark plans to converge the APIs even further.

Here some benchmarks of DataSets:

Transformation of Data Types in Spark

Data & Analytics

Spark RDD-DF-SQL-DS-Spark Hadoop User Group Munich Meetup 2016