Upload
datamantra
View
527
Download
2
Embed Size (px)
Citation preview
Introduction to Dataset APIOvercoming limitations of Dataframes
https://github.com/shashankgowdal/introduction_to_dataset
● Shashank L
● Big data consultant and trainer at datamantra.io
● www.shashankgowda.com
Agenda
● History of Spark APIs● Limitations of Dataframes● Dataset● Encoders● Dataset hierarchy● Performance● Roadmap
RDD API (2011)
● Distributed collection for JVM objects
● Immutable and Fault tolerant
● Processing structured and unstructured data
● Functional transformations
Limitations of RDD API
● No schema associated
● Optimization should be done by from user’s end
● Reading from multiple sources is difficult
● Combining multiple sources is difficult
DataFrame API (2013)
● Distributed collection for Row objects
● Immutable and Fault tolerant
● Processing structured data
● Optimization from Catalyst optimizer
● Data source API
Limitations of Dataframe
● Compile time type safety
● Cannot operate on domain objects
● Functional programming API
Compile time safety
val dataframe = sqlContext.read.json("people.json")
dataframe.filter("salary > 1000").show()
Throws Runtime exception
org.apache.spark.sql.AnalysisException: cannot resolve 'salary' given input columns age, name;
Operating on domain objects
val personRDD = sc.makeRDD(Seq(Person("A",10), Person("B",20)))
//Create RDD[Person]
val personDF = sqlContext.createDataFrame(personRDD)
//Create dataframe from a RDD[Person]
personDF.rdd
//We get back RDD[Row] and not RDD[Person]
Dataset
Dataset
an extension of the DataFrame API that provides a type-safe, object-oriented programming interface
Dataset API
● Type-safe: Operate on domain objects with compiled lambda functions
● Fast: Code generated encoders for fast serialization
● Interoperable: Easily convert Dataframe Dataset without boilerplate code
Dataset API
Encoders
● Encoder converts from JVM object into a Dataset row
● Code generated encoders for fast serialization
JVM Object
Dataset row
Encoder
Compile time safety check
case class Person(name: String, age: Long)
val dataframe = sqlContext.read.json("people.json")val ds : Dataset[Person] = dataframe.as[Person]
ds.filter(p => p.age > 25)
ds.filter(p => p.salary > 12500)//error: value salary is not a member of Person
Operating on domain objects
val personRDD = sc.makeRDD(Seq(Person("A",10), Person("B",20)))
//Create RDD[Person]
val personDS = sqlContext.createDataset(personRDD)
//Create Dataset from a RDD
personDS.rdd
//We get back RDD[Person] and not RDD[Row] in Dataframe
Functional programmingcase class Person(name: String, age: Int)
val dataframe = sqlContext.read.json("people.json")val ds : Dataset[Person] = dataframe.as[Person]
// Compute histogram of age by nameval hist = ds.groupBy(_.name).mapGroups({ case (name, people) => { val buckets = new Array[Int](10) people.map(_.age).foreach { a => buckets(a / 10) += 1 } (name, buckets) }})
Dataset hierarchy
SQL
Dataframe (& Dataset)
Python R Scala/Java
Tungsten execution
Hands on● Creating dataset
○ From Collections○ From File
● Comparison with RDD○ Operations○ Distributed Wordcount
● Semistructured data○ Downcast○ Upcast
Execution performance
Caching memory usage
Serialization performance
Roadmap
● Dataset, the name itself may change
● Performance optimizations
● Unification of DataFrames with Dataset
● Public API for Encoders
● Support for most of the RDD operators on Dataset
Unification of DataFrames with Dataset
class Dataset[T](
val sqlContext: SQLContext,
val queryExecution: QueryExecution)(
implicit val encoder: Encoder[T])
class DataFrame(
sqlContext: SQLContext,
queryExecution: QueryExecution)
extends Dataset[Row](sqlContext, queryExecution)(new RowEncoder)
References
● https://databricks.com/blog/2016/01/04/introducing-spark-datasets.html
● https://issues.apache.org/jira/browse/SPARK-9999● https://goo.gl/Wqc561 - API design