25
Introduction to Dataset API Overcoming limitations of Dataframes https://github.com/shashankgowdal/introduction_to_dataset

Introduction to dataset

Embed Size (px)

Citation preview

Page 1: Introduction to dataset

Introduction to Dataset APIOvercoming limitations of Dataframes

https://github.com/shashankgowdal/introduction_to_dataset

Page 2: Introduction to dataset

● Shashank L

● Big data consultant and trainer at datamantra.io

● www.shashankgowda.com

Page 3: Introduction to dataset

Agenda

● History of Spark APIs● Limitations of Dataframes● Dataset● Encoders● Dataset hierarchy● Performance● Roadmap

Page 4: Introduction to dataset

RDD API (2011)

● Distributed collection for JVM objects

● Immutable and Fault tolerant

● Processing structured and unstructured data

● Functional transformations

Page 5: Introduction to dataset

Limitations of RDD API

● No schema associated

● Optimization should be done by from user’s end

● Reading from multiple sources is difficult

● Combining multiple sources is difficult

Page 6: Introduction to dataset

DataFrame API (2013)

● Distributed collection for Row objects

● Immutable and Fault tolerant

● Processing structured data

● Optimization from Catalyst optimizer

● Data source API

Page 7: Introduction to dataset

Limitations of Dataframe

● Compile time type safety

● Cannot operate on domain objects

● Functional programming API

Page 8: Introduction to dataset

Compile time safety

val dataframe = sqlContext.read.json("people.json")

dataframe.filter("salary > 1000").show()

Throws Runtime exception

org.apache.spark.sql.AnalysisException: cannot resolve 'salary' given input columns age, name;

Page 9: Introduction to dataset

Operating on domain objects

val personRDD = sc.makeRDD(Seq(Person("A",10), Person("B",20)))

//Create RDD[Person]

val personDF = sqlContext.createDataFrame(personRDD)

//Create dataframe from a RDD[Person]

personDF.rdd

//We get back RDD[Row] and not RDD[Person]

Page 10: Introduction to dataset

Dataset

Page 11: Introduction to dataset

Dataset

an extension of the DataFrame API that provides a type-safe, object-oriented programming interface

Page 12: Introduction to dataset

Dataset API

● Type-safe: Operate on domain objects with compiled lambda functions

● Fast: Code generated encoders for fast serialization

● Interoperable: Easily convert Dataframe Dataset without boilerplate code

Page 13: Introduction to dataset

Dataset API

Page 14: Introduction to dataset

Encoders

● Encoder converts from JVM object into a Dataset row

● Code generated encoders for fast serialization

JVM Object

Dataset row

Encoder

Page 15: Introduction to dataset

Compile time safety check

case class Person(name: String, age: Long)

val dataframe = sqlContext.read.json("people.json")val ds : Dataset[Person] = dataframe.as[Person]

ds.filter(p => p.age > 25)

ds.filter(p => p.salary > 12500)//error: value salary is not a member of Person

Page 16: Introduction to dataset

Operating on domain objects

val personRDD = sc.makeRDD(Seq(Person("A",10), Person("B",20)))

//Create RDD[Person]

val personDS = sqlContext.createDataset(personRDD)

//Create Dataset from a RDD

personDS.rdd

//We get back RDD[Person] and not RDD[Row] in Dataframe

Page 17: Introduction to dataset

Functional programmingcase class Person(name: String, age: Int)

val dataframe = sqlContext.read.json("people.json")val ds : Dataset[Person] = dataframe.as[Person]

// Compute histogram of age by nameval hist = ds.groupBy(_.name).mapGroups({ case (name, people) => { val buckets = new Array[Int](10) people.map(_.age).foreach { a => buckets(a / 10) += 1 } (name, buckets) }})

Page 18: Introduction to dataset

Dataset hierarchy

SQL

Dataframe (& Dataset)

Python R Scala/Java

Tungsten execution

Page 19: Introduction to dataset

Hands on● Creating dataset

○ From Collections○ From File

● Comparison with RDD○ Operations○ Distributed Wordcount

● Semistructured data○ Downcast○ Upcast

Page 20: Introduction to dataset

Execution performance

Page 21: Introduction to dataset

Caching memory usage

Page 22: Introduction to dataset

Serialization performance

Page 23: Introduction to dataset

Roadmap

● Dataset, the name itself may change

● Performance optimizations

● Unification of DataFrames with Dataset

● Public API for Encoders

● Support for most of the RDD operators on Dataset

Page 24: Introduction to dataset

Unification of DataFrames with Dataset

class Dataset[T](

val sqlContext: SQLContext,

val queryExecution: QueryExecution)(

implicit val encoder: Encoder[T])

class DataFrame(

sqlContext: SQLContext,

queryExecution: QueryExecution)

extends Dataset[Row](sqlContext, queryExecution)(new RowEncoder)