Click here to load reader
Upload
comsysto-reply-gmbh
View
318
Download
0
Embed Size (px)
Citation preview
Making sense of RDDs, DataFrames, SparkSQL andDatasets APIsMotivation
overview over the different Spark APIs for working with structured data.
Timeline of Spark APIsSpark 1.0 used the RDD API - a Resilient Distributed Dataset (RDD), the basic abstraction in Spark. Represents an immutable,partitioned collection of elements that can be operated on in parallel.Spark 1.3 introduced the DataFrames API - a distributed collection of data organized into named columns. Also a well knownconcept from R / Python Pandas.Spark 1.6 introduced an experimental Datasets API - extension of the DataFrame API that provides a type-safe, object-orientedprogramming interface.
RDDRDD - Resilient Distributed DatasetFunctional transformations on partitioned collections of opaque objects.
Define case class representing schema of our data.
Each field represent column of the DB.
>
defined class Person
Create parallelized collection (RDD)
>
peopleRDD: org.apache.spark.rdd.RDD[Person] = ParallelCollectionRDD[5885] at parallelize at <console>:95
RDD of type Person
>
rdd: org.apache.spark.rdd.RDD[Person] = MapPartitionsRDD[5886] at filter at <console>:98
NB: Return Person objects
>
Person(Sven,38)
case class Person(name: String, age: Int)
val peopleRDD = sc.parallelize(Array( Person("Lars", 37), Person("Sven", 38), Person("Florian", 39), Person("Dieter", 37)))
val rdd = peopleRDD .filter(_.age > 37)
rdd .collect .foreach(println(_))
(http://databricks.com) ð Import Notebook
RDDs, SQL, DataFrames and DataSets
Person(Florian,39)
DataFramesDeclarative transformations on partitioned collection of tuples.
>
peopleDF: org.apache.spark.sql.DataFrame = [name: string, age: int]
>
+-------+---+ | name|age| +-------+---+ | Lars| 37| | Sven| 38| |Florian| 39| | Dieter| 37| +-------+---+
>
root |-- name: string (nullable = true) |-- age: integer (nullable = false)
Show only age column
>
+---+ |age| +---+ | 37| | 38| | 39| | 37| +---+
NB: Result set consists of Arrays of String und Ints
>
[Sven,38] [Florian,39]
DataSetsCreate DataFrames from RDDs
Implicit conversion is also available
>
peopleDS: org.apache.spark.sql.Dataset[Person] = [name: string, age: int]
NB: Result set consist of Person objects
val peopleDF = peopleRDD.toDF
peopleDF.show()
peopleDF.printSchema()
peopleDF.select("age").show()
peopleDF .filter("age > 37") .collect .foreach(row => println(row))
val peopleDS = peopleRDD.toDS
>
ds: org.apache.spark.sql.Dataset[Person] = [name: string, age: int]
>
Person(Sven,38) Person(Florian,39)
>
res294: org.apache.spark.sql.catalyst.plans.logical.LogicalPlan = MapPartitions <function1>, class[name[0]: string, age[0]: int], class[name[0]: string, age[0]: int], [name#11044,age#11045]+- LogicalRDD [name#11041,age#11042], MapPartitionsRDD[5894] at rddToDatasetHolder at <console>:97
>
res295: org.apache.spark.sql.catalyst.plans.logical.LogicalPlan = MapPartitions <function1>, class[name[0]: string, age[0]: int], class[name[0]: string, age[0]: int], [name#11044,age#11045]+- LogicalRDD [name#11041,age#11042], MapPartitionsRDD[5894] at rddToDatasetHolder at <console>:97
>
Person(Sven,38) Person(Florian,39)
Spark SQL>
import org.apache.spark.sql.SQLContext sqlContext: org.apache.spark.sql.SQLContext = org.apache.spark.sql.hive.HiveContext@50ec5129
Register DataFrame for usage via SQL
>
The results of SQL queries are DataFrames and support all the usual RDD operations.
>
res298: org.apache.spark.sql.DataFrame = [name: string, age: int]
Print execution plan
>
res299: org.apache.spark.sql.catalyst.plans.logical.LogicalPlan = Project [name#11037,age#11038] +- Filter (age#11038 > 37) +- Subquery sparkpeopletbl +- LogicalRDD [name#11037,age#11038], MapPartitionsRDD[5887] at rddToDataFrameHolder at <console>:97
Optimized by built-in optimizer execution plan
val ds = peopleDS .filter(_.age > 37)
ds.collect .foreach(println(_))
ds.queryExecution.analyzed
ds.queryExecution.optimizedPlan
ds.collect .foreach(println(_))
// Get SQL context from Spark context// NB: "In Databricks, developers should utilize the shared HiveContext instead of creating one using the constructor. In Scala and Python notebooks, the shared context can be accessed as sqlContext. When running a job, you can access theshared context by calling SQLContext.getOrCreate(SparkContext.getOrCreate())."
import org.apache.spark.sql.SQLContextval sqlContext = SQLContext.getOrCreate(SparkContext.getOrCreate())
peopleDF.registerTempTable("sparkPeopleTbl")
sqlContext.sql("SELECT * FROM sparkPeopleTbl WHERE age > 37")
sql("SELECT * FROM sparkPeopleTbl WHERE age > 37").queryExecution.analyzed
>
res300: org.apache.spark.sql.catalyst.plans.logical.LogicalPlan = Project [name#11037,age#11038] +- Filter (age#11038 > 37) +- LogicalRDD [name#11037,age#11038], MapPartitionsRDD[5887] at rddToDataFrameHolder at <console>:97
>
4
Sven 38
Florian 39
>
res301: Array[org.apache.spark.sql.Row] = Array([Sven,38], [Florian,39])
>
res302: Array[String] = Array(NAME: Sven, NAME: Florian)
>
res303: Array[String] = Array(NAME: Sven, NAME: Florian)
>
res304: org.apache.spark.sql.DataFrame = [name: string, age: int]
Running SQL queries agains Parquet files directly>
? s3a://bigpicture-guild/nyctaxi/sample_1_month/parquet/_SUCCESS [0.00 MiB] ? s3a://bigpicture-guild/nyctaxi/sample_1_month/parquet/_common_metadata [0.00 MiB] ? s3a://bigpicture-guild/nyctaxi/sample_1_month/parquet/_metadata [0.03 MiB] ? s3a://bigpicture-guild/nyctaxi/sample_1_month/parquet/part-r-00000-5df3eb09-ce03-4ac6-8770-287ce4782749.gz.parquet [53.92MiB] ? s3a://bigpicture-guild/nyctaxi/sample_1_month/parquet/part-r-00001-5df3eb09-ce03-4ac6-8770-287ce4782749.gz.parquet [53.87MiB] ? s3a://bigpicture-guild/nyctaxi/sample_1_month/parquet/part-r-00002-5df3eb09-ce03-4ac6-8770-287ce4782749.gz.parquet [53.84MiB] ? s3a://bigpicture-guild/nyctaxi/sample_1_month/parquet/part-r-00003-5df3eb09-ce03-4ac6-8770-287ce4782749.gz.parquet [53.92MiB] ? s3a://bigpicture-guild/nyctaxi/sample_1_month/parquet/part-r-00004-5df3eb09-ce03-4ac6-8770-287ce4782749.gz.parquet [53.89MiB] ? s3a://bigpicture-guild/nyctaxi/sample_1_month/parquet/part-r-00005-5df3eb09-ce03-4ac6-8770-287ce4782749.gz.parquet [53.87MiB] ? s3a://bigpicture-guild/nyctaxi/sample_1_month/parquet/part-r-00006-5df3eb09-ce03-4ac6-8770-287ce4782749.gz.parquet [53.89MiB] ? s3a://bigpicture-guild/nyctaxi/sample_1_month/parquet/part-r-00007-5df3eb09-ce03-4ac6-8770-287ce4782749.gz.parquet [53.89MiB] ? s3a://bigpicture-guild/nyctaxi/sample_1_month/parquet/part-r-00008-5df3eb09-ce03-4ac6-8770-287ce4782749.gz.parquet [53.89MiB] ? s3a://bigpicture-guild/nyctaxi/sample_1_month/parquet/part-r-00009-5df3eb09-ce03-4ac6-8770-287ce4782749.gz.parquet [53.95MiB]
>
0E0157EB3E6F927FE7FA86C3A0762B3B 4E0569072023608FFEB72F454CCF408B VTS 1 2013-01-13T11:08:00.000+0000
19BF1BB516C4E992EA3FBAEDA73D6262 E4CAC9101BFE631554B57906364761D3 VTS 2 2013-01-13T10:33:00.000+0000
D57C7392455C38D9404660F7BC63D1B6 EC5837D805127379D72FF6C35279890B VTS 1 2013-01-13T04:58:00.000+0000
67108EDF8123623806A1DAFE8811EE63 36C9437BD2FF31940BEBED44DDDDDB8A VTS 5 2013-01-
name age
medallion hack_license vendor_id rate_code store_and_fwd_flag pickup_datetime
sql("SELECT * FROM sparkPeopleTbl WHERE age > 37").queryExecution.optimizedPlan
%sql SELECT * FROM sparkPeopleTbl WHERE age > 37
sql("SELECT * FROM sparkPeopleTbl WHERE age > 37").collect
sql("SELECT * FROM sparkPeopleTbl WHERE age > 37").map(row => "NAME: " + row(0)).collect
sql("SELECT * FROM sparkPeopleTbl WHERE age > 37").rdd.map(row => "NAME: " + row(0)).collect()
sql("SELECT * FROM sparkPeopleTbl WHERE age > 37")
ls3("s3a://bigpicture-guild/nyctaxi/sample_1_month/parquet/")
%sql SELECT * FROM parquet.`s3a://bigpicture-guild/nyctaxi/sample_1_month/parquet/` WHERE trip_time_in_secs <= 60
4
Showing the first 1000 rows.
13T04:58:00.000+0000
5F78CC6D4ECD0541B765FECE17075B6F 39703F5449DADC0AEFFAFEFB1A6A7E64 VTS 1 2013-01-13T08:56:00.000+0000
ConclusionsRDDsRDDs remain the core component for the native distributed collections. But due to lack of built-in optimization, DFs and DSs should
be prefered.
DataFramesDataFrames and Spark SQL are very flexible and bring built-in optimization also for dynamic languages like Python and R. Beside
this, it allows to combine both declarative and functional way of working with structured data. It's regarded to be most stable and
flexible API.
DatasetsDatasets unify the best from both worlds: type safety from RDDs and built-in optimization available for DataFrames. But it is in
experimental phase and can be used with Scala, Java and Python. DSs allow even further optimizations (memory compaction + faster
serialization using encoders). DataSets are not yet mature, but it is expected, that it will quickly become a common way to work with
structured data, and Spark plans to converge the APIs even further.
Here some benchmarks of DataSets:
Transformation of Data Types in Spark